Individuals with autism spectrum disorder are impaired in absolute but not relative pitch and duration matching in speech and song imitation

Individuals with autism spectrum disorder (ASD) often exhibit atypical imitation. However, few studies have identified clear quantitative characteristics of vocal imitation in ASD. This study investigated imitation of speech and song in English‐speaking individuals with and without ASD and its modulation by age. Participants consisted of 25 autistic children and 19 autistic adults, who were compared to 25 children and 19 adults with typical development matched on age, gender, musical training, and cognitive abilities. The task required participants to imitate speech and song stimuli with varying pitch and duration patterns. Acoustic analyses of the imitation performance suggested that individuals with ASD were worse than controls on absolute pitch and duration matching for both speech and song imitation, although they performed as well as controls on relative pitch and duration matching. Furthermore, the two groups produced similar numbers of pitch contour, pitch interval‐, and time errors. Across both groups, sung pitch was imitated more accurately than spoken pitch, whereas spoken duration was imitated more accurately than sung duration. Children imitated spoken pitch more accurately than adults when it came to speech stimuli, whereas age showed no significant relationship to song imitation. These results reveal a vocal imitation deficit across speech and music domains in ASD that is specific to absolute pitch and duration matching. This finding provides evidence for shared mechanisms between speech and song imitation, which involves independent implementation of relative versus absolute features.


INTRODUCTION
Imitation is a fundamental skill that emerges early in typical human development (Meltzoff, 2017). It is essential for learning of complex constructs, including language (McEwen et al., 2007;Rose et al., 2009) and social interaction (Kuhl, 2007;Masur, 2006;Vivanti & Hamilton, 2014). In particular, by imitating others or being imitated, individuals gradually become aware of the physical world, such as cause-effect relations (Meltzoff & Williamson, 2013); the mental states of other people, such as their intentions and feelings (Meltzoff & Keith Moore, 1994); and the sounds around them, such as languages (Charman et al., 2000;Young et al., 2011).
Imitation in individuals with autism spectrum disorder (ASD) is often described as atypical (Williams et al., 2001). Deficits in imitation skills in ASD have been reported for a variety of tasks, including action imitation, which involves using body and hands to act (Ham et al., 2011;Young et al., 2011); object-directed action imitation where actions involve objects (Cossu et al., 2012;; facial imitation (Bernier et al., 2007) and vocal imitation . Specifically, when individuals with ASD are instructed to imitate an action or utterance, they imitate with lower levels of accuracy and do so less frequently than typically developing (TD) counterparts (Edwards, 2014;Turan & Okcun Akcamus, 2012;Vivanti & Hamilton, 2014). Imitation deficits in ASD mainly manifest in high fidelity imitation of form, rather than in emulation of function or end points (Edwards, 2014). Functional magnetic resonance imaging studies suggest dysfunction of the mirror neuron system during action imitation in ASD (Yang & Hofmann, 2016).
Compared with other areas of imitation (e.g., action, object, and face), research on vocal imitation in ASD is relatively scarce and has only focused on either the speech or music domain. Although several studies have addressed vocal imitation of speech in ASD, results to date are mixed regarding whether and to what extent (pitch, duration, and/or the balance between the two) individuals with ASD are associated with speech imitation deficits. Some differences across studies may be due to the use of acoustic analyses versus the use of perceptual ratings. For instance, based on ratings by speech and language therapists, children with ASD had impaired imitation of various prosodic forms, including affect, intonation, chunking, and focus McCann & Peppé, 2003;Peppé et al., 2007Peppé et al., , 2011. By contrast, acoustic analyses of pitch range showed no difference across groups for imitation of stress, despite the fact that ASD participants received lower perceptual ratings of accuracy than TD controls (Paul et al., 2008). Thus, Van Santen et al. (2010) called attention to the unreliability and bias of clinicians' perceptual ratings (not strictly blind to participants' diagnostic status) and advocated the advantages and objectivity of instrumental methods. However, studies employing acoustic measures to assess imitation performance also produced divergent findings, with one study showing a group difference in duration only but not in mean pitch (Diehl & Paul, 2012) and other studies reporting both pitch and duration differences between groups (Fosnot & Jun, 1999;Hubbard & Trauner, 2007).
In contrast to speech, music has been seen as an area of exceptional skills in ASD (Molnar-Szakacs & Heaton, 2012; Ouimet et al., 2012). However, only one study has examined music imitation in ASD, and the results suggested that children with ASD showed comparable or better performance than controls when imitating pitch, rhythm, and duration of musical tones based on independent observers' judgment (Applebaum et al., 1979). Thus, regarding vocal imitation, ASD seems associated with atypical speech imitation but normal to superior music imitation. Given that vocal imitation is crucial for language acquisition (Kuhl, 2000;Kuhl & Meltzoff, 1996) and successful imitation requires sensorimotor, cognitive, and social skills (Fridland & Moore, 2015;Heyes, 2001;Nguyen & Delvaux, 2015;Over & Carpenter, 2013;Pagliarini et al., 2020), a potential impairment in vocal imitation may be related to landmark deficits of ASD including social and communicative difficulties (American Psychiatric Association, 2013;Diehl et al., 2015;McCann & Peppé, 2003). Previous studies have suggested that musical training benefits speech processing (Patel, 2011(Patel, , 2012 and similar acoustic cues are used in emotional communication in music and speech (Juslin & Laukka, 2003). In addition, vocal imitation mechanisms are likely shared between speech and song production in adults with typical development Wisniewski et al., 2013). Thus, the intimate link between music and speech begs the question as to whether vocal imitation impairment in ASD is indeed domain specific, especially when there has only been one study examining music imitation in ASD (Applebaum et al., 1979).
The domain specificity or generality of vocal imitation impairment in ASD is particularly relevant to a longstanding debate about whether speech and music share the same underlying processing systems (Albouy et al., 2020;Norman-Haignere et al., 2015;Zatorre & Gandour, 2008). The modular or domain-specific framework proposes that speech and music may involve distinct modules or mechanisms that deal with a particular aspect of the input and its output representation, either exclusively or more effectively than any other mechanisms (Fodor, 1983(Fodor, , 2001Peretz, 2009;Peretz et al., 2015;Peretz & Coltheart, 2003;Peretz & Zatorre, 2005). While speaking and singing involve multiple processing components, musical abilities depend, in part, on modular processes such as tonal encoding of pitch, which is music-specific and independent of spoken pitch processing (Peretz, 2009;Peretz & Coltheart, 2003). In contrast to this view, others have suggested that speech and music systems may not be entirely modular or independent (Kunert & Slevc, 2015;Patel, 2013). Rather, there are shared or domain-general mechanisms underlying the processing of information across both domains (Koelsch, 2011;Koelsch & Siebel, 2005;Patel, 2008;Sammler et al., 2009). Numerous studies have provided evidence in support of either the domain-specific or domain-general view (Kunert & Slevc, 2015;Peretz et al., 2015). In addition to comparing music with language processing in typical development (Slevc et al., 2009;Slevc & Miyake, 2006), neurodevelopmental disorders such as congenital amusia (Liu et al., 2010(Liu et al., , 2013 and ASD (DePriest et al., 2017;Jiang et al., 2015) could offer special insight into this debate, particularly regarding whether deficits are only present in one domain (e.g., music), but not in the other (e.g., speech).
Specifically, as a functional output representation, vocal imitation of speech and song could inform the domain-specific versus domain-general debate from a production perspective (Peretz, 2009). Using matched speech and song stimuli, Liu et al. (2013) directly compared speech with music imitation in individuals with and without congenital amusia, a disorder of music processing (Ayotte et al., 2002). Individuals with congenital amusia demonstrated impaired pitch and duration matching of speech and song in terms of both absolute and relative measures. These findings suggest that vocal imitation mechanisms are likely shared between speech and song production even in congenital amusia (Liu et al., 2013). Although prior findings on vocal imitation in individuals with ASD seem to favor the domainspecific model, there have been too few published studies especially in music imitation in ASD to draw valid conclusions to inform the theoretical debate of music and language processing in this clinical population. Also, no studies have directly compared imitation abilities in speech versus music in ASD using matched linguistic and musical tasks to address the question of whether impaired speech imitation but spared/enhanced music imitation would be present in the same sample of participants. Furthermore, the absolute measures on pitch and duration matching used in Liu et al. (2013) required higher fidelity imitation than the relative measures. It remains to be determined whether individuals with ASD would show worse performance on absolute feature matching than relative feature matching during vocal imitation, similar to other areas of imitation in ASD (Edwards, 2014).
Studies of speech imitation in TD children and adults suggest that speech imitation ability is influenced by age (Kent & Forner, 1980;Loeb & Allen, 1993;Snow, 1998). Specifically, young children (aged 4) tended to imitate speech segments with longer duration and greater variability than older children (aged 6 and 12) and adults (Kent & Forner, 1980). While 3-and 4-year-olds showed more difficulty in imitating rising intonation contours in questions than falling intonation contours in statements, 5-year-olds were able to imitate both types of contours (Loeb & Allen, 1993;Snow, 1998). Studies of music imitation have examined pitch matching of tones or melodies in children (Cooper, 1995;Geringer, 1983) and adults (Amir et al., 2003;Pfordresher & Brown, 2007). Among children, pitch matching accuracy increases with age, with fourth-graders (9-10 years old) performing significantly better than third-graders (8-9 years old) (Cooper, 1995). While more than half of fourth-graders could match pitch within 50 cents (0.5 semitones), preschool children (4-5 years old) produced a median deviation of 2.5 semitones (Geringer, 1983). Results on tempo matching or rhythm reproduction during music imitation suggest that, at 6 years, both musicians and non-musicians were able to reproduce rhythmic patterns embedded in a string of syllables (Gérard & Auxiette, 1992;Reifinger, 2006), and rhythmic response ability increased with age from Grade 1 to Grade 3 students (Schleuter & Schleuter, 1985). In addition, adults outperformed 5-to 7-year-old children on rhythm repetition, melody repetition, prosody repetition, as well as a range of other language and music tasks (Cohrdes et al., 2016). Thus, similar to speech imitation, music imitation ability is also influenced by age in typical development. In ASD, the evidence from studies on non-vocal imitation (e.g., action, object, and face) suggests that, although imitative abilities increase over time, impairments in imitation continue throughout the lifespan in ASD (Biscaldi et al., 2014;Vivanti & Hamilton, 2014;Young et al., 2011). However, how age influences speech and music imitation in ASD has not been systematically studied.
In the current study, we examined vocal imitation abilities in children and adults with and without ASD using matched speech and song stimuli, addressing three research questions: (1) Do imitation abilities of individuals with ASD differ from controls in terms of pitch and duration matching across speech and music domains?
(2) Do individuals with ASD differ from controls with respect to relative and absolute feature matching in vocal imitation? (3) Do vocal imitation abilities in ASD and TD vary with age? Based on previous findings, we hypothesized that: (1) Participants with ASD would show impaired pitch and duration imitation in speech but not in song compared to controls; (2) Participants with ASD would show poorer performance on absolute feature matching than on relative feature matching as compared to controls; and (3) Across both groups, the adult cohort would perform better than the child cohort overall.

METHOD Participants
A group of 44 individuals with ASD and 44 matched controls were recruited via a variety of methods including email lists, local social media advertisements, and local experimental participant databases. All were native British English speakers with no speech or hearing problems, and reported no history of other neurological or psychiatric disorders. Participants in the ASD group had a formal diagnosis of ASD by clinicians. Participants in the control group were included using the cut-off scores of 32 (adults), 30 (adolescents) or 76 (children) on the Autism-Spectrum Quotient (AQ) (Auyeung et al., 2008;Baron-Cohen et al., 2001. All participants had normal hearing in both ears, with pure-tone air conduction thresholds of 25 dB HL or better at frequencies of 0.5, 1, 2, and 4 kHz. Participants' nonverbal IQ was estimated using the Raven's Standard Progressive Matrices Test (Raven et al., 1998), and verbal IQ was estimated by the Receptive One Word Picture Vocabulary Test IV (ROWPVT-IV) (Martin & Brownell, 2011). The Corsi block-tapping task was used to assess participants' nonverbal short-term memory span (Kessels et al., 2000), and the forward digit span task was used to assess verbal short-term memory (Wechsler, 2003). Participants were further divided into two age cohorts, children (<16) and adults (≥16), based on the age cut-off of 16 years, following the definition of adults in the AQ (Baron-Cohen et al., 2001). The reason that we used a two-way rather than a three-way split of age cohorts (children, adolescents, and adults) was to ensure that there were enough participants in each cohort. The age range of the child cohort was between 7.39 and 15.75 years and that of the adult cohort was between 16 and 56.75 years. Children's music perception skills were assessed using the Montreal Battery of Evaluation of Musical Abilities (MBEMA), which consists of five subtests with 20 trials each measuring the perception of scale, contour, interval, rhythm and recognition memory of musical melodies (Peretz et al., 2013). Adults were assessed using the Montreal Battery of Evaluation of Amusia (MBEA), which contains six subtests with 30 trials each measuring the perception of scale, contour, interval, rhythm, meter, and recognition memory of musical melodies . All participants also completed a questionnaire about their musical, language, and medical background, where they were also asked to report whether they possess absolute pitch or perfect pitch, the ability to identify a musical note without a reference tone (Deutsch, 2013). As can be seen in Table 1, the ASD and control groups were comparable on all background measures. The study was approved by the University of Reading Research Ethics Committee. Written informed consent/assent was obtained from the participants and/or their parents prior to the experiment.

Stimuli
The target stimuli were 12 sentences either spoken or sung as statements or questions from Mantell and Pfordresher (2013), yielding 48 sentences with three to five syllables each. The speech stimuli were naturally spoken, and the pitch-time trajectory did not correspond to any diatonic scales. In order to create contour variation in the sequences, statements were produced with a falling contour and questions with a rising contour. The song stimuli comprised pitches from a major diatonic scale that approximated the global melodic contours of the speech stimuli. Each sung syllable had a roughly identical duration so as to invoke a metrical beat, resulting in the song stimuli being longer than the speech stimuli. Three versions of the speech/song stimuli were used for different age and gender groups. The adult male and female versions were taken from Mantell and Pfordresher (2013) and used for male/female participants ≥12 years old. For child participants <12 years old, a child version was created by a child (female, 11-year-old, with 5 years of musical training) imitating the female version but in her own pitch range (see Figure 1; for more details, see Table S1).

Procedure
The presentation of the target stimuli and the recording of the imitations were both done using Praat (Boersma & Weenink, 2001). Participants were seated in a soundproof booth and presented with four practice trials (with items different from those in experimental trials: two speech vs. two song) to familiarize themselves with the task and the recording environment. Following the practice session, participants were presented with each of the 48 speech/song sentences one at a time in a pseudorandom order to ensure that different experimental conditions would alternate in an unpredictable manner and that long runs of the same condition (possible with true randomization) would not occur. Participants were instructed to imitate exactly the pitch and timing patterns of the sentences to the best of their ability, while their voices were recorded. Each sentence was played once and only replayed when participants failed to catch the words, and not when they wanted to listen to it again so they could imitate it better.

Data analysis
Recordings were analyzed in Praat using ProsodyPro (Xu, 2013) to extract the pitch and duration of each syllable rhyme. The rhyme was defined as the vowel portion of the syllable plus any final voiced consonant (e.g., car, book), which was done by the first author (a phonetician). Octave errors in pitch imitation were corrected, that is, when imitated pitch was more than 6 semitones (half octave) apart from the model pitch, the value was adjusted as 12imitated pitch. In total, less than 3% of the data samples needed to be adjusted and most of these errors were caused by creaky voices, resulting in decreased fundamental frequency, F0 (Johnson, 2011). For accurate acoustic analysis of the data, we used ProsodyPro to manually add these missed vocal pulse marks for F0 based on the waveforms and spectrograms, to avoid having erroneous outliers misleading imitation results. Trials were not excluded when participants repeated the sentences slightly incorrectly but with the correct rhyme, for example, substituting "he" for "she" or "brought" for "bought." In the literature, pitch accuracy in singing and imitation has been analyzed using a variety of measures, such as using median F0 (Dalla Bella et al., 2007;Liu et al., 2013) or mean F0 (Hutchins & Peretz, 2012) of the vowel or vocalic group to indicate pitch height of each note/syllable, or calculating mean absolute pitch error and pitch correlation across the entire pitch trajectories of the model and imitated sequences . For timing accuracy, either subjective ratings, for example, 0 = "incorrect," 0.5 = "partly correct," and 1 = "correct" (Cohrdes et al., 2016), or objective acoustic analyses, for example, number of time errors as determined by a 25% time deviation (Dalla Bella et al., 2007;Liu et al., 2013;Tremblay-Champoux et al., 2010) have been used. The pros and cons of these different methods and measures have been discussed (Dalla Bella, 2015). Since the ability to imitate/produce absolute versus relative features and pitch versus timing variables can dissociate in different "phenotypes" of poor singing (Berkowska & Dalla Bella, 2013;, it is recommended that these dimensions be examined separately (Dalla Bella, 2015). Compared to mean F0, median T A B L E 1 Characteristics of the ASD (n = 44) and control groups (n = 44) F0 is a preferable measure of pitch height, since it is less affected by extreme or erroneous variation of F0 due to creaky voice (Dalla Bella, 2015). In contrast to the whole trajectory analysis of each sequence , measuring the median F0 of each note/ syllable rhyme (or vowel group) makes the calculation of pitch interval and pitch contour (two critical components in memory for melodies) between consecutive notes/syllables possible (Dowling & Fujitani, 1971). Most importantly, similar to music, there are pitch targets in speech across tone and intonation languages, such as high, low, rising, and falling, and they are realized based on linguistic functions and articulatory constraints (Xu, 2005;Xu & Prom-on, 2014;Xu & Wang, 2001). With a tonal perception model, speech prosody can be transcribed using a stylization of pitch levels and movements coupled with vocalic segments (Mertens, 2004), enabling the comparison of spoken and musical rhythm and melody (Patel et al., 2006). Thus, taking a comparative approach to studying music and language (Patel et al., 2006)  The absolute pitch deviation (in cents): Median F0 was extracted from each syllable rhyme and then subtracted from that of their matched model to find the pitch deviation (in absolute value) for each imitated rhyme. The deviations were averaged over all syllables/notes in each utterance/melody, in order to control for the nonindependence of the data points within each utterance/ melody (Mcdonald, 2014). The bigger the value, the less accurate the imitation in terms of absolute pitch matching.
The relative pitch deviation (in cents): Pitch interval was calculated as the absolute difference in median F0 between two consecutive syllables/notes, and then subtracted from their matched model speaker's pitch interval (in absolute value). The deviations were averaged over all intervals in each utterance/melody and the bigger the value, the less accurate the imitation in terms of relative pitch matching.
The number of contour errors: Contour errors were defined as imitated pitch intervals that differed from the corresponding target pitch intervals in regard to pitch directions (up, down, or level). Pitch direction was considered to be up or down if the difference in pitch interval F I G U R E 1 The pitch-time trajectory of the sentence "They went home" under different conditions by child/female/male target speakers was higher or lower by 50 cents (100 cents = 1 semitone) or more; otherwise (the difference was within 50 cents), the pitch intervals were considered to form a level/flat pitch direction. The number of contour errors was summed over each utterance/melody.
The number of pitch interval errors: Pitch interval errors were defined as imitated pitch intervals that were larger or smaller than the corresponding target pitch intervals by 100 cents without considering the pitch direction. Specifically, imitated and target pitch intervals were compared using absolute values. The number of pitch interval errors was summed over each utterance/melody.
The absolute duration difference (in milliseconds): Duration was extracted from each syllable rhyme and then subtracted from their matched model speaker's production to find the absolute difference for each rhyme. The differences were averaged over all rhymes in each utterance/melody and the larger the value, the less accurate the imitation in terms of absolute duration matching.
The relative duration difference (in milliseconds): Interonset interval (IOI) was calculated as the difference between two consecutive syllables/notes, and then subtracted from their matched model speaker's IOI (in absolute value). The differences were averaged over all IOIs in each utterance/melody and the larger the value, the less accurate the imitation in terms of relative duration matching.
The number of time errors: Time errors were defined as imitated syllables/notes that were more than 25% longer or shorter than the corresponding target syllables/ notes (Dalla Bella et al., 2007;Prince & Pfordresher, 2012). This measure takes into account that in Western tonal music, event durations constitute simple integer ratio relationships, for example, sixteenth notes (1/4 a beat), eighth notes (1/2 a beat), quarter notes (1 beat), and so forth, and counting time errors this way will capture the violation to the time signature (Drake & Palmer, 2000). Similarly, in stress-timed languages such as English, speech rhythm can also be measured in relative terms, making the comparison of spoken and musical rhythm possible (Patel et al., 2006;Patel & Daniele, 2003). The number of time errors was summed over each utterance/melody. Statistical analyses were conducted using Rstudio (RStudio Team, 2020). We performed linear mixed effects analysis using the lme4 (Bates et al., 2015;Brauer & Curtin, 2018) package with the abovementioned pitch and time variables as the dependent variable and Diagnostic Group (effect-coded: Control vs. ASD), Age cohorts (effect-coded: Child vs. Adult), and Condition (effect-coded: Speech vs. Music) as well as all possible interactions as fixed effects. All models were fit using the maximal random effects structure that converged with two random factors (subject vs. file) (Barr, 2013;Barr et al., 2013). When the maximal model failed to converge, the random correlations were removed first. If the model still failed to converge, the random effect with the least variance was iteratively removed until the model converged. Statistical significance of the fixed effects was estimated using the summary() function of the lmerTest package (Kuznetsova et al., 2017), which provided p values for the corresponding t tests. Subsequent post-hoc comparisons, if any, were conducted using the emmeans package (Lenth et al., 2018). Additionally, to evaluate the performance of those participants who self-reported possessing absolute pitch, we closely inspected the results of these participants ( Table 3). Given that the ASD group showed impaired imitation of absolute pitch, we took the values from the control group as the "standard" (M(SD) = 124.48(97.45), Range: 63.36-298.26) and only two of them performed better than the average level.  Table 2).   Table 2, the linear mixed-effects model revealed a significant main effect of Condition, as both groups showed fewer pitch interval errors in the Music condition (M (SD) = 22.56(13.46)) than in the Speech condition (M (SD) = 39.28(11.2)). The interaction between Age and Condition was also significant, although Post-hoc analyses with Bonferroni correction revealed no significant difference between the child cohorts and adult cohorts in either condition (Speech: t(138) = À1.54, p = 0.13; Music: t(138) = 1.92, p = 0.06). No other remaining main effects and interactions were significant.  Table 4, a significant main effect of Group, as the ASD group (M (SD) = 57.92(40.09)) produced significantly larger absolute duration differences than did the Control group (M (SD) = 51.48(32.55)). The main effect of Condition was also significant, with both groups showing larger absolute duration differences in the Music condition (M F I G U R E 2 Boxplots of pitchrelated measures for the ASD and control groups. (a) the absolute pitch deviation; (b) the relative pitch deviation; (c) the number of contour errors; (d) the number of pitch interval errors (asterisks represent p-values between variables with *p < 0.05, **p < 0.01 and ***p < 0.001). ASD, autism spectrum disorder (SD) = 71.73(41.31)) than in the Speech condition (M (SD) = 37.67(20.01)). No other remaining main effects and interactions were significant.  Table 4).

Estimate
Std. error df t p The absolute pitch deviations model   Table 4).

DISCUSSION
The present study investigated imitation of speech and song in English-speaking individuals with and without ASD and its modulation by age using absolute and relative pitch and duration measures. The main results showed that individuals with ASD were worse than controls on absolute pitch and duration matching, while performing as well as controls on relative pitch and duration matching in both speech and song imitation. In addition, the two groups produced similar numbers of pitch contour errors, pitch interval errors, and time errors. Furthermore, like the controls, individuals with ASD imitated sung pitch more accurately than spoken pitch, whereas spoken duration was imitated more accurately than sung duration. Across both groups, children tended to imitate pitch more accurately than adults when it came to speech stimuli rather than song stimuli, whereas adults made fewer time errors than did children in both stimulus types.
In terms of absolute feature matching during vocal imitation, we discovered impaired performance in the ASD group for both pitch and duration across both speech and song conditions as compared to the control group. This finding is in line with previous results showing impaired imitation of form in ASD (Edwards, 2014). A few previous studies also showed impaired pitch and duration imitation for speech in ASD (Fosnot & Jun, 1999;Hubbard & Trauner, 2007). However, other studies indicated that speech imitation deficits in ASD only manifested in duration (Diehl & Paul, 2012;Paul et al., 2008). The discrepancy may be related to the different methods used to measure imitation performance across the studies. While we compared group differences in imitation by measuring how well participants in each group matched the pitch and duration features of the model utterances, previous studies ignored the model but F I G U R E 3 Boxplots of duration-related measures for the ASD and control groups. (a) the absolute duration difference; (b) the relative duration difference; (c) the number of time errors (asterisks represent p-values between variables with *p < 0.05, **p < 0.01 and ***p < 0.001). ASD, autism spectrum disorder compared the pitch and duration patterns of the produced utterances across groups (Diehl & Paul, 2012;Paul et al., 2008). Thus, as in previous vocal imitation studies (Liu et al., 2013;, we measured imitation abilities by comparing acoustic parameters between the model and imitated utterances, and the smaller the difference, the more accurate the imitation. Using this method, we were able to reveal differences in absolute feature matching during imitation between groups. However, previous studies only showed the differences in characteristics between the produced utterances of the two groups (Diehl & Paul, 2012;Paul et al., 2008), thus measuring speech production, rather than imitation accuracy.
In contrast to the intact musical imitation abilities reported in a previous study on children with ASD (Applebaum et al., 1979), our finding demonstrated that both children and adults with ASD were impaired in absolute pitch and duration matching for song imitation. One explanation for this discrepancy may be related to how the accuracy of imitation was calculated. Specifically, Applebaum et al. (1979) relied on subjective perceptual ratings of imitation accuracy by two independent observers, whereas the current study employed objective acoustic analyses. A second possible explanation relates to the difference in sample size. While 88 participants (44 per group) were involved in the present study, only six individuals participated in Applebaum et al.'s (1979) study (three per group). Thus, the current results may be more reliable given the objective acoustic analyses and a larger sample size.
Despite impaired absolute pitch and duration matching, individuals with ASD showed comparable performance to controls on relative pitch and duration matching, as well as on other measures of relative-feature matching (e.g., number of pitch contour, pitch interval, and time errors). Our results are consistent with previous findings on poor singers ). For instance, Dalla  examined occasional singers' pitch and duration accuracy in terms of both absolute and relative features when spontaneously producing wellknown melodies, as well as when imitating these melodies with a metronome at a slower tempo. They found that poor singers performed less accurately in the absolute measures than the relative measures and suggested that the production of relative and absolute pitch and time features may be independent in the music domain. Our results extend those of Dalla , showing that the dissociation between relative and T A B L E 4 Results from the linear mixed-effects model for the durationrelated measures.

Estimate
Std. error df t p The absolute duration differences model Note: *p < 0.05, **p < 0.01, and ***p < 0.001. absolute pitch and duration matching is also the case for impaired vocal imitation in ASD and that the dissociation exists not only in music but also in speech. However, to the best of our knowledge, no previous studies in ASD have examined relative versus absolute feature or relative feature matching alone in either speech or music imitation in ASD, which makes it difficult to find evidence to explain why individuals with ASD showed preserved relative but impaired absolute pitch and duration matching during vocal imitation. We propose two possibilities for the divergent results of absolute versus relative feature matching in ASD below, which would require further investigations by future studies. First, one possibility might relate to the differential requirement for fidelity of imitation between absolute and relative features. There has been extensive evidence from non-vocal studies (e.g., action, objects, and face) suggesting that individuals with ASD manifest impaired imitation ability in tasks that require high fidelity imitation, such as reproducing precisely both the form and the end result of a model (Edwards, 2014). However, tasks requiring lower fidelity, such as emulation that only requires reproducing the final result/goal without considering the forms needed to achieve the final goal, generally fail to observe deficits in the ASD group (Hamilton, 2008;Edwards, 2014). In our study, the absolute measures required higher fidelity imitation compared to the relative measures. In particular, absolute measures examined the exact matching of pitch and duration features for each syllable/note, while relative measures assessed the matching of the relative pitch and timing relationship between two consecutive syllables/notes. Thus, our current results indicate for the first time that, consistent with non-vocal imitation studies (Hamilton, 2008;Edwards, 2014), individuals with ASD show impaired vocal imitation ability in tasks requiring high fidelity (i.e., absolute feature matching), but not in tasks requiring lower fidelity (i.e., relative feature matching). Second, it is possible that the imitation mode (relative vs. absolute) that participants were experiencing during vocal imitation may account for the dissociation. Specifically, evidence from perception research in TD indicates that, as children mature from 3 to 6 years, there is a general developmental shift from an absolute to a relative mode in pitch perception (Crozier, 1997;Saffran & Griepentrog, 2001;Sergeant & Roche, 1973;Takeuchi & Hulse, 1993). Studies also found that while adults relied primarily on relative pitch cues, they were able to access absolute cues under certain conditions (Saffran & Griepentrog, 2001), and both children and adults demonstrated absolute memory of familiar melodies (Levitin, 1994;Schellenberg & Trehub, 2003. Taking these findings together, it is possible that different participants may depend on different perception modes when imitating speech and song. The reason that the two groups did not differ significantly in relative pitch and duration matching may be because participants all tended to the relative cues. While controls also accessed absolute cues during the process, participants with ASD did not or were less capable of doing so. Future studies are required to test this possibility by examining the relationship between perception and production during vocal imitation. When using acoustically matched speech and song stimuli testing the same sample of participants, we observed impairments (i.e., absolute pitch and duration matching) as well as preserved skills (i.e., relative pitch and duration matching) in ASD not only in speech but also in music. Hence, compatible with the findings of vocal imitation in people with typical development  and those with congenital amusia (Liu et al., 2013), vocal imitation also constitutes domain-general mechanisms in individuals with ASD. These findings provide support for using music therapy to improve speech for those individuals with ASD who manifest deficits in language (James et al., 2015). In addition, successful imitation requires sensorimotor, cognitive, and social skills (Fridland & Moore, 2015;Heyes, 2001;Nguyen & Delvaux, 2015;Over & Carpenter, 2013;Pagliarini et al., 2020). Thus, the benefit of music imitation may extend to improving cognitive and social skills in ASD (Boster et al., 2020).
It has been reported that absolute pitch (AP) ability is more common among individuals with ASD than in nonclinical populations (Heaton et al., 1998;Mottron et al., 1999;Stanutz et al., 2014). However, the present imitative results were not in line with these findings. Rather, we found that individuals with ASD showed impaired absolute pitch and duration matching. While we did not test our participants' receptive AP ability in the current study, we did ask whether they have absolute pitch (or perfect pitch, the ability to identify a musical note without a reference tone) in a questionnaire. According to the self-reports, two children with ASD (out of 25) and two adults with ASD (out of 19), as well as two control children (out of 25) possessed AP. However, they did not perform exceptionally when imitating absolute pitch, which suggests that receptive AP may not transfer to expressive AP in imitation. This finding is consistent with the dual-route model, which posits that vocal stimuli are processed for motor-relevant features and conscious, symbolic representations along two different, independent pathways (Hutchins & Moreno, 2013). Thus, vocal perception and production abilities could be uncorrelated, and each can be intact or impaired without affecting the other (Griffiths, 2008;Hutchins & Moreno, 2013;Loui, 2015). Notably, our findings are based on self-reports rather than experimental testing of AP. Further studies are needed to clarify the nature of receptive and productive AP in ASD.
Moreover, the current study examined whether speech and song imitation abilities vary with age in ASD and controls. Across both groups, adults made fewer time errors in both speech and song imitation relative to the child cohort. Time errors were defined as deviation from the target duration by 25%, and this is the only measure where the two age cohorts differed significantly, but not in other timing matching measures (e.g., absolute and relative duration matching). These results suggest that while both children and adults can imitate the duration of speech and song segments comparably, children may have greater duration variability than adults when errors were measured relative to the duration of each segment. The findings are in agreement with previous research indicating that there is a developmental decrease in duration variability (Kent & Forner, 1980;Munson, 2004;Smith, 1978). Indeed, children possess less refined neuromotor capabilities than adults (Smith, 1978), and they are unable to exert adult-like control of speech production mechanisms. Hence, children's output reflects greater variability of phonetic segments compared to adults (Kent & Forner, 1980;Koenig et al., 2008;Munson, 2004;Smith et al., 1996).
Conversely to what was observed in timing matching, across both groups, children tended to imitate absolute and relative pitch more accurately than adults when it came to speech stimuli rather than song stimuli. This result may be due to children attending to speech pitches more readily than adults. Speech imitation is based on intentional understanding (Over & Gattis, 2010). Individuals thus tend to imitate the functional goal (e.g., statements with falling pitch contours vs. questions with rising pitch contours) rather than copying the exact form of the utterances (Liu et al., 2010(Liu et al., , 2013. In the present study, this tendency appeared more pronounced in adults than in children. Indeed, we did not find any differences between the child and adult cohorts in the pitch contour imitation, as they all preferred to and were able to imitate the functional goals (rising vs. falling). However, adults neglected form-related information in speech more saliently, resulting in poorer performance than children on exact matching of absolute and relative pitch. On the other hand, the results could also mean that children do not make as strong distinctions between speech and song as adults do. Studies have shown that, unlike musical communication, speech comprehension is remarkably robust to lack of detail in pitch variation Patel, 2011;Patel et al., 2010). This is because the need for pitch precision in speech can be relaxed by integrating multiple context-based cues (including the voice onset time, vowel length, fundamental frequency, and first and second formant patterns) and knowledge sources (including semantics, syntax, and pragmatics) (Mattys et al., 2005;. However, since the integrating abilities in children are not as mature as adults (McCreery & Stelmachowicz, 2011;Stelmachowicz et al., 2000), they may still mainly rely on pitch cues in speech imitation as they do in music imitation.
Generally speaking, we did not observe the developmental increase in imitative abilities that has been suggested by previous studies in speech (Kent & Forner, 1980;Loeb & Allen, 1993;Snow, 1998) and music (Cohrdes et al., 2016;Cooper, 1995;Geringer, 1983), except in the duration variability. One explanation for this discrepancy may be related to differences in age of the participants among the studies. The youngest child participant in the present study was 7.39 years old, whilst several previous studies examined the development from 3 to 5 years (Loeb & Allen, 1993;Snow, 1998) or 5-7 years (Cohrdes et al., 2016). Thus, it is possible that the present task was too simple to reveal the developmental change for participants beyond 7 years old, since 5-year-olds were already able to imitate falling versus rising contours (Loeb & Allen, 1993;Snow, 1998). In addition, the different grouping of age cohorts between studies might also account for the discrepancy in findings. Specifically, we grouped participants below 16 into the child cohort and those above 16 into the adult cohort, and age-related differences were then examined by comparing these two age cohorts. However, previous studies compared age-related differences at year-level (Cooper, 1995;Geringer, 1983;Kent & Forner, 1980;Loeb & Allen, 1993;Snow, 1998), for example, comparing 5 years with 4 years (Loeb & Allen, 1993;Snow, 1998). Thus, subtle developments over time may be masked in the present study, given the wide age range within each age cohort. Across our pre-defined age cohorts, however, there was no significant age Â group interaction on any of the absolute or relative pitch or duration measures we examined. This suggests that age (≥16 or <16 years) influences speech and music imitation similarly across ASD and TD. Thus, our results on vocal imitation corroborate previous findings of persistent impairments in other areas of imitation across the lifespan in ASD (Biscaldi et al., 2014;Vivanti & Hamilton, 2014;Young et al., 2011).
Consistent with previous studies (Liu et al., 2013;Wisniewski et al., 2013), both the ASD and control groups imitated song more accurately than did speech across all pitch-related measures. Several possibilities may explain this result. First, the reason for the enhanced pitch imitation in songs may be because, in order to achieve adequate communication, a higher degree of pitch precision is required for conveying musical meaning than speech meaning (Patel, 2008(Patel, , 2011. Indeed, even individuals with congenital amusia imitated musical pitch better than linguistic pitch, since music is form-driven and speech is function-driven (Liu et al., 2013). Studies of intonation imitation among TD adults also suggest that English speakers tended to imitate the phonological structure (e.g., pitch accent, intonational phrase boundary), rather than the phonetic details (e.g., pause duration, irregular pitch periods), of intonation (Cole & Shattuck-Hufnagel, 2011). Thus, the worse pitch matching in speech imitation compared to song imitation may be because people tend to imitate the functional goal, rather than the exact form, of speech utterances (Liu et al., 2010). Secondly, one may argue that the slower tempo in songs might have positively affected pitch imitation, since singing accuracy improves considerably when people sing at slower as opposed to faster tempos (Dalla Bella et al., 2007). However, even when durations were equated across speech and song stimuli, the pitch imitation advantage for singing remained . Thus, the enhanced sung pitch imitation cannot simply be attributed to differences in the rate of speech versus song stimuli in the present study.
Interestingly, our results on duration matching across speech and song imitation indicate that the effect of domain is not equivalent across different duration measures. Whereas both groups achieved better absolute and relative duration matching in speech than in song, they made fewer time errors in song than in speech. The fewer time errors in song than in speech may not necessarily imply that duration matching during song imitation was superior to speech imitation for both groups. This is because the more time errors in speech than in song may be caused by the higher time precision required in speech. Specifically, given that our time errors were defined using deviation from 25% of the target duration and that song stimuli contained longer target durations than speech stimuli (see Figure 1), the accuracy requirement was higher for speech than for song imitation. The results of more accurate imitation for both absolute and relative duration in speech than in song are consistent with previous findings (Albouy et al., 2020;Liu et al., 2013;. Overall, both groups tended to be particularly sensitive to the appropriateness of duration in speech compared to that in song, and were more sensitive to pitch in song than in speech, suggesting that pitch imitation is independent from the imitation of duration (Dalla Bella et al., 2007;Drake & Palmer, 2000;. A caveat about the design of the current study is the stimuli we used for participants to imitate in our experiment. Taken from Mantell and Pfordresher (2013), the stimuli for the adult male and female versions were produced in American English with a midland dialect (male speaker) and an inland North dialect (female speaker). The child version was created by a child imitating the female model in British English. Since all our participants were British English speakers, one may wonder whether or to what extent the different dialects have affected imitative performance. However, research has shown that speakers can imitate detailed intonational patterns of a different variety of their language (D'Imperio et al., 2014). Similar to controls, individuals with ASD who have good language abilities can perceive acoustic differences in dialects as well as use these cues to group the dialects into the areas they come from (Clopper et al., 2012). Consistent with these findings, children, female and male adults with ASD and their TD counterparts in our study performed comparably in imitation of relative pitch and duration, suggesting that different dialects may not have affected imitation performance of our participants. In addition, impairments of absolute pitch and duration matching in ASD were observed not only in the adult cohort but also in the child cohort who imitated British English. Taken together, while the effect of dialect is unlikely to have influenced the current results, further studies are required to examine this specific hypothesis.

CONCLUSION
Using sentences and melodies that shared critical features, this study revealed for the first time that vocal imitative skills in individuals with ASD are impaired in absolute pitch and duration matching but intact in relative pitch and duration matching across speech and music domains. From children to adults, vocal imitation showed an improvement in the number of time errors across speech and song, but a decrease in pitch imitation accuracy in the speech condition only. These findings support the idea that speech and song imitation may involve shared cognitive and motor mechanisms, which may have implications for the development of language in individuals with ASD (Stone & Yoder, 2001).