What Characterizes Comprehensible and Native-like Pronunciation Among English-as-a-Second-Language Speakers? Meta-Analyses of Phonological, Rater, and Instructional Factors

The current study presents two meta-analyses to explore what underlies the assessment and teaching of comprehensible and nativelike pronunciation among English-as-a-Second-Language speakers. In Study 1, listener studies (n = 37) were retrieved examining the inﬂuence of segmental, prosodic, and temporal features on listeners’ intuitive judgements of comprehensibility and nativelikeness/ accentedness as per different listener backgrounds (expert, mixed, L2). In Study 2, training studies (n = 17) were retrieved examining the effects of segmental, prosodic, and temporal-based instruction on ESL learners’ pronunciation. The results showed that (a) comprehensibility judgements were related to a range of segmental, prosodic, and temporal features; (b) accentedness judgements were strongly tied to participants’ correct pronunciation of consonants and vowels; and (c) instruction led to larger gains in comprehensibility than in nativelikeness. Moderator analyses demonstrated that expert listeners were more reliant on phonological information. Greater effects of instruction on comprehensibility than nativelikeness became clearer, especially when the treatment targeted prosodic accuracy. The ﬁndings suggest that ESL practitioners should prioritize suprasegemental practice to help students achieve comprehensible L2 pronunciation. The attainment of nativelike pronunciation, by contrast, may require an exclusive focus on the reﬁnement of segmental accuracy, which is resistant to the inﬂuence of instruction. these projects have looked at different dimensions of global L2 pronunciation proﬁciency. Whereas Suzuki et al.’s focus lies in the acoustic correlates of perceived ﬂuency (rather than accuracy), Crowther’s report aims to provide a comprehensive analysis of comprehensibility, accentedness, and intelligibility. Here I would like to claim that the topic (i.e., what matters for listeners’ intuitive reactions to foreign accented speech) will continue to grow as an important research agenda in the ﬁeld, given that the ﬁndings of the meta-analyses (including mine) will help us design and carry out future studies with more rigorous methodologies. framework of L2 pronunciation proﬁciency: (a) pronouncing consonants and vowels correctly (i.e., segmental accuracy), (b) assigning adequate stress at the word and sentence levels (i.e., prosodic accuracy), and (c) delivering speech at an optimal tempo (i.e., temporal ﬂuency). The corresponding segmental, prosodic, and temporal analysis measures are summarized in Supporting Information.

A ttaining native-like pronunciation has long been considered a pedagogical priority in English-as-a-Second-Language (ESL) classrooms all over the world (e.g., Scales, Wennerstrom, Richard, & Wu, 2006). However, experts in the field of second language (L2) pronunciation have pointed out that the majority of adult L2 speech ends up foreign-accented (e.g., Flege, Munro, & MacKay, 1995). Because of this, the criteria underlying L2 pronunciation assessment and teaching should arguably prioritize communicative success over native-like production (e.g., Derwing & Munro, 2015). The current investigation reports on the results of two meta-analytic studies which examine the factors that underlie listeners' intuitive judgements of comprehensible versus native-like L2 English speech. The findings of these analyses are intended to not only inform L2 pronunciation pedagogy but also to draw tentative conclusions regarding our theoretical understanding of the complex relationship among speakers, listeners, L2 speech assessment, and acquisition.

BACKGROUND Assessing Second Language Pronunciation
It is well documented that L2 pronunciation is coloured by phonological and phonetic features found in the first language (L1), especially when the onset of learning begins after puberty (Flege et al., 1995). Much of the material for teaching and learning pronunciation is driven by a nativelikeness orientation (e.g., Foote et al., 2011 for ESL in Canada), which seeks to reduce or eliminate L1 accent from L2 speech (Tokumoto and Shibata, 2011). The attainment of native-like pronunciation, however, may be limited to individuals with specific cognitive-perceptual abilities, such as phonemic coding (Hu et al., 2013), associative memory (Silbert et al., 2015), and precise auditory processing and acuity (Saito, Kachlicka, Sun, & Tierney, 2020). Furthermore, very few learners are able to reach native-like pronunciation norms, and may only be able to do so if their L1 is linguistically close to the target language (e.g., Dutch learners of English; Bongaerts, van Summeren, Planken, & Schils, 1997; see also Saito, Macmillan, et al., 2020 for Indo-European vs. non Indo-European speakers of L2 English).
In light of these findings, it is important that language teachers are made aware that attaining native-like L2 pronunciation is a difficult taskeven if it is an idealized goal. It may also be an unnecessary one, however, considering that much English-medium communication takes place between L2 users themselves. In this setting, foreign accent is a normal and expected characteristic of L2 speech (Pennycook, 2017). On the basis of this argument, a number of scholars have emphasized the importance of attaining the more realistic and achievable goals of comprehensibility, intelligibility, and communicative adequacy, as these are what ultimately matter for successful L2 communication (Levis, 2018).

Comprehensibility and Accentedness
The concepts, methodologies, and operationalizations of "realistic" pronunciation goals have widely varied in primary studies (e.g., Munro & Derwing, 2011 for a list of different outcome measures for "intelligibility"; for further discussion, see the Future Directions section in this paper). The current study focuses on two global constructs of L2 pronunciation proficiency: comprehensibility (i.e., ease of understanding) and accentedness (i.e., phonological nativelikeness; for more detailed discussion on different dimensions of L2 pronunciation proficiency, see Saito & Plonsky, 2019). Since Derwing and Munro's seminal work (e.g., Derwing & Munro, 1997;Munro & Derwing, 1995), much attention has been given to contrasting L2 comprehensibility and accentedness. From a methodological perspective, both constructs are measured in the same way-by tapping into listeners' intuitive judgements of L2 speech. Upon hearing a sample of speech, raters use a 9-point scale to evaluate how comprehensible (1 = difficult to understand, 9 = easy to understand) and accented (1 = heavily accented, 9 = no accent) that sample was. In other L2 speech assessment studies, accentedness has also been operationalized as "global foreign accent" (e.g., Riney & Takagi, 1999) and "perceived nativelikeness" (Abrahamsson & Hyltenstam, 2009). In essence, these terms (accentedness, global foreign accent, and nativelikeness) refer to a conceptually similar phenomenon-how closely L2 speech approximates the phonological norm of native speakers. However, the concept of accentedness stands in sharp contrast with that of comprehensibility, which is assumed to index listeners' effort, and by extension ease, of understanding. 1 This intuitive approach to assessing comprehensible and native-like L2 pronunciation has strong ecological validity, as it is assumed to reflect the instant and impressionistic judgements made by interlocutors during oral communication in real-life contexts (whether communication takes place between L1 and L2 speakers or between L2 and L2 speakers). It also differs from expert assessment, where professional coders are trained to determine global proficiency in accordance with detailed rubrics (see Issacs, Trofimovich, Yu, & Muñoz Chereau, 2015 for a discussion of the relationship between comprehensibility and L2 pronunciation proficiency in IELTS).
Findings from several studies conducted by Derwing and Munro have shown that it is common for speech to be judged as highly accented yet remain comprehensible, suggesting that comprehensibility and accentedness are essentially different phenomena (Derwing & Munro, 1997;Munro & Derwing, 1995). Examining what characterizes and distinguishes comprehensibility and accentedness is a crucial initiative with a number of practical implications. Thus far, many scholars have delved into which phonological features are relatively important (or irrelevant) to listener judgements of L2 comprehensibility and accentedness judgements. Such studies enable practitioners to identify discrete sets of pronunciation features that students could practice as a priority in order to improve their global L2 pronunciation proficiency in accordance with their goals (i.e., enhancing comprehensibility vs. nativelikeness; Trofimovich & Isaacs, 2012) and intended interlocutors (e.g., L1 vs. L2 listeners; Saito, Tran et al., 2019).
From a theoretical standpoint, comprehensibility (rather than nativelikeness) is crucial for measuring adult L2 speech development. The Interaction Hypothesis (Gass, 1997;Long, 1996;Mackey, 2012), for instance, posits that language learning takes place precisely when input is made comprehensible during conversational interaction between speakers. A great deal of attention has been directed towards investigating L2 speakers' interlanguage development and ultimate attainment of L2 comprehensibility and accentedness in both naturalistic (e.g., Derwing & Munro, 2013;Saito, 2015) and classroom settings (e.g., Nagle, 2018;Saito & Hanzawa, 2018). These studies generally agree that comprehensibility can continue to improve irrespective of the degree of foreign accentedness as long as the L2 is used regularly on a daily basis (for a similar theoretical account of adult L2 speech learning, see Flege & Bohn, 2020). Contrastingly, the incidence of native-like L2 pronunciation is considerably rare among post-pubertal learners, and determined by factors related to learners' special talent rather than the length, quality, and timing of L2 use (e.g., Saito, Kachlicka, et al., 2020 for auditory sensitivity). Therefore, examining the phonological correlates of comprehensibility and accentedness is an important step towards shedding light on the driving force of two major dimensions of L2 speech learning.
Studies adopting a longitudinal perspective have also examined the relative importance of specific phonological features in L2 comprehensibility and accentedness by adopting a pre-and post-test design. These studies typically deliver, analyse, and compare the efficacy of different types of instruction that are related to different features of interest (e.g., segmentals vs. suprasegmentals). Results of primary studies have thus far shown that teaching certain features can make a perceptible impact on comprehensibility and accentedness. Features examined so far include English-specific segmentals (Saito, 2011;Wisniewska & Mora, 2020); word/sentence stress and intonation (Gluhareva & Prieto, 2017;; and speech clarity, fluidity, and smoothness (Hamada, 2018 for shadowing; Galante & Thomson, 2017 for drama-based techniques; Tran & Saito, in press for 4/3/2 activity).

Listener Factors
In line with Derwing and Munro's framework, comprehensibility and accentedness can be affected by factors related not only to speakers but also to listeners. In other words, even if two listeners assess the same speech stimuli, their ratings may differ to some degree due to, for example, the quantity and quality of their experience with L2 speech assessment. As reviewed above, much discussion has revolved around how L2 speakers can improve the segmental and suprasegmental qualities of their speech (speaker ? listener comprehensibility/accentedness). Though fewer in number, some empirical studies have begun to illustrate how listeners' backgrounds influence their comprehensibility judgements, and how they can adjust the strategies used when listening to accented speech (listener ? speaker comprehensibility/accentedness) (for an overview, Derwing & Munro, 2015).
A critical line of research has attempted to identify the factors that influence listeners' judgements of L2 comprehensibility and accentedness. For example, it has been shown that listeners tend to assign more lenient scores when they are familiar with particular foreign accents (Kennedy & Trofimovich, 2008) and topics (Gass & Varonis, 1984), have a linguistics and/or teaching background (Saito, Trofimovich, Isaacs, & Webb, 2016;Saito, Trofimovich, & Isaacs, 2017), and have multilingual experience (Saito & Shintani, 2016;Shintani, Saito, & Koizumi, 2019). A subset of studies has also highlighted the differences and similarities between L1 and L2 listeners' L2 comprehensibility judgements (e.g., Foote & Trofimovich, 2018;Saito, Tran et al., 2019). It is noteworthy, however, that other studies have failed to find significant effects of listener backgrounds in L2 comprehensibility and accentedness judgements (e.g., Isaacs & Thomson, 2013

Reliability Coefficients
Another factor that remains relatively unexplored is the reliability of L2 comprehensibility and accentedness judgements. In high-stakes speaking assessments, professional raters receive hours of special training to rate speech using holistic rubrics. These raters practice and reach agreement rates that typically range between 0.70 and 0.80 (e.g., see Chen et al., 2018 for TOEFL speaking tasks). It is noteworthy that comprehensibility and accentedness judgements are made intuitively by listeners without any training or rating descriptors. Thus, one obvious question concerns whether the judgements of untrained listeners can reach a comparable agreement rate, and whether the strength of agreement varies according to listener backgrounds. Although there is some qualitative research which hints that experienced listeners likely have a clear understanding of their assessment processes compared to na€ ıve listeners (Isaacs & Thomson, 2013), to my knowledge, no synthesis of the research has included degree of consistency as a variable of interest.
According to Plonsky and Derrick (2016), the lack of research on the development, discussion, and provision of guidelines on the reliability of comprehensibility and accentedness judgements problematizes future research, since it considerably clouds the interpretability of study findings (i.e., whether to ascribe any parts of results to variables in concern or to unreliable outcome measures). Furthermore, the comparability of studies remains unclear even if the same methods have been used, which, in turn, negatively impacts the construct validity of meta-analyses on this topic. In the field of applied linguistics, some scholars have proposed rough estimates of acceptable rates of consistency (e.g., a >.70 as "moderate" to "substantial"; Brown, 2014). Plonsky and Derrick (2016) surveyed different types of reliability estimates reported in 535 primary studies, finding that the benchmark of satisfactory inter-rater reliability could be relatively high (a =.92, interquartile range =.13).

MOTIVATION FOR CURRENT STUDY
Over the past 15 years, the distinction between comprehensible and native-like L2 pronunciation has attracted a great deal of attention from researchers and practitioners alike. To further examine precisely what distinguishes between comprehensibility and accentedness in L2 pronunciation assessment and teaching, there are several topics worthy of further investigation which could have important implications for ESL practitioners all over the world. To synthesize what underlies L2 comprehensibility and accentedness judgements, I will present the results of a meta-analysis on the existing literature. In particular, the paper highlights how L2 comprehensibility and accentedness are differentially tied to (a) speaker factors (i.e., which speech properties affect judgements) and (b) listener factors (i.e., how listener backgrounds influence judgements) (Derwing & Munro, 2015). These are taken up as the main issues in the current study.

Speaker Factors
The first topic relates to the process of L2 comprehensibility and accentedness judgements (i.e., what phonological dimensions underlie listeners' instant, intuitive, and scalar judgements of comprehensibility and accentedness). A clearer understanding of these dimensions could lead to numerous practical implications. For assessment, the findings could reveal whether listener behaviours actually differ between the supposedly distinguishable constructs of L2 speech assessment (i.e., perceiving ease of understanding and nativelikeless). For teaching, the findings could directly inform practitioners about which pronunciation features make the most impact on comprehensibility and accentedness, and how learners can be encouraged to achieve two different goals of L2 pronunciation learning in an efficient and effective manner (enhancing comprehensibility vs. reducing foreign accentedness).
As reviewed earlier, a number of primary studies have focused on a range of pronunciation features that significantly affect L2 comprehensibility and accentedness (for a narrative review, Munro, Derwing, & Thomson, 2015). Accordingly, it is unsurprising that studies directly comparing the phonological correlates of comprehensibility and accentedness within a single study have led to different observations. For example, although Munro and Derwing (1995) found that segmental accuracy was a primary determinant of accentedness, Trofimovich and Issacs (2012) found that prosodic accuracy accounted for the largest amount of variance in both comprehensibility and accentedness.
Intervention studies have produced similarly mixed findings. For example, some studies have demonstrated that both suprasegmentaland segmental-based instruction affect comprehensibility and accentedness, especially when its effectiveness was tested via controlled tasks (e.g., Zhang & Yuan, 2020). Other studies, however, suggest that segmental-based instruction is facilitative of comprehensibility but not accentedness (e.g., Saito, 2011), and that suprasegmental-based instruction likely leads to more gains in comprehensibility (e.g., Gordon & Darcy, 2016). These studies have yet to reach a consensus on which pronunciation features actually matter for the assessment and training of L2 comprehensibility and accentedness.

Listener Factors
A second topic worth clarifying is whether the phonological correlates of comprehensibility and accentedness are subject to the influence of listener background. The aforementioned literature review has brought to light the lack of agreement on this topic. In spite of the supporting evidence, some studies have indicated that the role of listener background may be minor (Munro, Derwing, & Morton, 2006) and/or non-quantifiable (Isaacs & Thomson, 2013). In line with Plonsky and Derrick's (2016) call for the further examination of reliability estimates in applied linguistics research, it is crucial to take a first step towards surveying the inter-rater reliability of listeners' intuitive judgements in accordance with their background.

Previous Meta-Analyses
To my knowledge, there are two published meta-analysis studies concerning L2 pronunciation teaching, that is, Lee, Jang, and Plonsky (2015) and Saito and Plonsky (2019). 2 These projects examined how instruction could be facilitative of L2 pronunciation development on 2 During the final publication process of the current manuscript (March 2021), two similar meta-analysis projects were identified to be either published (Suzuki, Kormos, & Uchihara, 2021) or ongoing (Crowther, forthcoming). Using different screening criteria (including diverse L1-L2 pairings), these projects have looked at different dimensions of global L2 pronunciation proficiency. Whereas Suzuki et al.'s focus lies in the acoustic correlates of perceived fluency (rather than accuracy), Crowther's report aims to provide a comprehensive analysis of comprehensibility, accentedness, and intelligibility. Here I would like to claim that the topic (i.e., what matters for listeners' intuitive reactions to foreign accented speech) will continue to grow as an important research agenda in the field, given that the findings of the meta-analyses (including mine) will help us design and carry out future studies with more rigorous methodologies. a broader level. L2 pronunciation proficiency was conceptualized/operationalized in many different ways such as overall impressions (comprehensibility, accentedness, and fluency), segmental accuracy (the correct pronunciation of consonants and vowels), prosodic accuracy (the lack and misplacement of word and sentence stress), and fluency (speech rate, pause ratio, repair, and self-repetition ratio). The instruction variable was treated as a monolithic construct without any mention of instructional focus (e.g., comprehensibility vs. accentedness; segmentals vs. prosody vs. fluency). More importantly, none of the meta-analyses analysed L2 pronunciation assessment; the current study took a first step towards detangling the relationship among constructs (comprehensibility and accentedness), speech properties (phonological accuracy and fluency), and rater backgrounds (expert vs. novice).

Research Questions
The mixed findings on these two important topics in L2 speech research-that is, speaker and listener factors in L2 comprehensibility and accentedness-could be ascribed to a range of methodological differences (e.g., speakers, elicitation methods, listeners, contexts). By synthesizing the outcomes of each primary research via a meta-analytic approach, the current study aims to provide a more comprehensive picture of the mechanisms underlying listeners' judgements of L2 speech. The following three research questions were formulated: 1. What is the observed inter-rater reliability of intuitive L2 comprehensibility and accentedness judgements? 2. Which pronunciation features do listeners use during their judgements of comprehensible and native-like pronunciation? 3. How does listener background influence the strength of agreement, and the relative weight of segmentals, prosody, and fluency in L2 comprehensibility and accentedness judgements?
In order to detangle the multilayered links among speakers, listeners, and L2 judgements, two different meta-analyses are conducted to approach this topic from two different angles. Study 1 focuses on which pronunciation features (segmentals, prosody, and fluency) listeners attend to while assessing the comprehensibility and accentedness of ESL speech (n = 37 listener studies). Study 2 focuses on the extent to which different types of instruction (segmental, prosodic, and temporal practice) can impact on L2 comprehensibility and accentedness in the most effective and efficient way (n = 17 training studies).

STUDY 1: LISTENER RESEARCH Study Retrieval and Screening
Focused and Narrow Approach. Following Plonsky and Brown's (2015) emphasis on the importance of defining a meta-analytic domain of interest, the scope of the search was carefully determined in conjunction with the objectives of the study. Although the previously published meta-analyses explored L2 pronunciation teaching (Lee et al., 2015;Saito & Plonsky, 2019), the current project concerns the assessment of L2 pronunciation proficiency. The search specifically focuses on the pronunciation factors that affect listeners' intuitive evaluations of the comprehensibility and accentedness of L2 English speech. This focus was adopted in order to provide pedagogical implications tailored to ESL practitioners in particular (teachers, students, and assessors). More importantly, there is evidence that the relative importance of L2 comprehensibility and accentedness greatly varies in accordance with different L1-L2 pairings, resulting in different phonetic features and interlanguage issues (Idemaru, Wei, & Gubbins, 2019). Following Plonsky and Brown's (2015) conceptual framework, the current study could be considered a focused meta-analysis in that it only included those studies directly relevant to the aims and context of the study.

Inclusion and Exclusion Criteria
The search procedures and inclusion/exclusion criteria used in the current study were inclusive in nature, featuring a wide range of publication sources, such as papers published in peer-reviewed journals, book chapters, research reports, conference proceedings, and PhD dissertations.
First, the literature search was conducted using a range of tools and sources. These included reference sections of primary studies and online search engines. The search engines were linked to six major library databases (Educational Resources Information Center, Linguistics and Language Behavior Abstracts, PsycINFO, PsycArticles, Web of Science, and ProQuest Dissertations) and one online resource (Google and Google Scholar).
Keywords for the search included accentedness, assessment, comprehensibility, fluency, foreign accent, intelligibility, nativelikeness, oral proficiency, pronunciation, raters, speaking, and speech. The publication year of Munro and Derwing's (1995) original paper was set as the starting point, and February 2020 as the final cut-off point.
Following the notion of an inclusive approach, ancestry searches were conducted on a range of peer-reviewed journals (e.g., Applied Linguistics, Applied Psycholinguistics; Bilingualism: Language and Cognition; Journal of Second Language Pronunciation; Language Learning; Language Teaching Research; Modern Language Journal; and TESOL Quarterly); key edited volumes, such as the handbook of English pronunciation (Reed & Levis, 2015), the Routledge handbook of contemporary English pronunciation (Kang, Thomson, & Murphy, 2017), and second language pronunciation assessment (Trofimovich & Isaacs, 2017); major conference proceedings (e.g., International Congress of Phonetic Sciences; Pronunciation in Second Language Learning and Teaching; New Sounds); education reports (e.g., IELTS Research Report Series); and PhD dissertations (included in ProQuest Dissertations).
Second, the decision was made to include only studies examining segmental, prosodic, and temporal influences on comprehensibility and accentedness judgements (the main objective of the study). Thus, studies examining listener behaviour only (e.g., Trofimovich & Isaacs, 2011) or the lexicogrammar correlates of comprehensibility and accentedness judgements (e.g., Ruivivar & Collins, 2018) were excluded.

Coding
To examine which pronunciation features listeners attuned to during L2 their comprehensibility and accentedness judgements, the predictor measures used in the primary studies were coded for three dimensions in accordance with Saito and Plonsky's (2019) framework of L2 pronunciation proficiency: (a) pronouncing consonants and vowels correctly (i.e., segmental accuracy), (b) assigning adequate stress at the word and sentence levels (i.e., prosodic accuracy), and (c) delivering speech at an optimal tempo (i.e., temporal fluency). The corresponding segmental, prosodic, and temporal analysis measures are summarized in Supporting Information.
Given that listener background plays a key role in judgements of L2 comprehensibility and accentedness (Isaacs & Thomson, 2013), the listeners featured in each primary study were coded as either expert or novice. In line with Yan and Ginther (2017), expert listeners were defined as those with previous academic knowledge of linguistics and/ or L2 teaching experience, whereas novice listeners were defined as those without such relevant experience. Given the nature of the dataset, two coding groups emerged: expert listeners (n = 12 studies) and mixed listeners (n = 21 studies). The latter category was applied to studies that included both expert and novice listeners. In addition, given that most of communication in English takes place between L2 users, we added another category: L2 listeners (n = 5 studies). Since there were only two studies examining L2 users' accentedness ratings (Crowther et al., 2017;del R ıo San Rom an, 2013), the moderator analyses were restricted to the dimension of L2 comprehensibility ratings.
Finally, all reliability estimates were included where available. In line with Plonsky and Derrick's (2016) suggested methodology, all relevant information including interclass correlations and Cronbach's alpha was combined and aggregated. This was because the purpose of the reliability meta-analyses was to provide a rough estimate of the distribution and variability of reliability coefficients. To examine the effects of listener backgrounds, inter-rater reliability was for each group of listeners: (a) expert, (b) L2 listeners, and (c) mixed. In all, 37 studies were initially coded by the first author. To check and ascertain the reliability of the coding, a PhD student in applied linguistics was trained using the coding scheme and separately coded approximately 50% of the data (n = 20 studies). There were no coding discrepancies. Thus, the author completed the coding of the remaining studies alone (n = 17 studies).

Results
Average Reliability. Following the analytic procedure recommended in Plonsky and Derrick (2016), the coefficients of reliability (interclass correlations, Cronbach's alpha) were aggregated among a total of 31 empirical studies, where researchers reported the reliability of listeners' comprehensibility and accentedness judgements. As visually plotted in Figure 1, the average inter-rater agreement was relatively high for both comprehensibility and accentedness (M coefficients =.896 for comprehensibility and.909 for accentedness). Descriptive statistics (summarized in Table 1) indicated substantial overlaps in 95% confidence intervals (CI) across the rating dimensions (comprehensibility vs. accentedness) and listener backgrounds, suggesting that listeners generally had strong agreement about the definitions of each dimension.
Effect Size Aggregation. Effect sizes (inverse-variance weighted mean correlation) for the second research question were aggregated using the metafor package (Viechtbauer, 2010) in the R statistical environment (R Core Team, 2018). A random effects model was used that  included the main moderator variable. For each primary study, the associations between L2 global ratings (comprehensibility and accentedness) and predictor phonological variables (segmentals, prosody, and fluency) were converted using Fisher's z-transformation (z = 0.5*ln(1 + r)/(1 À r)). Absolute values of the effect sizes were used given the different rating scales used in the studies (1 = not incomprehensible, 9 = comprehensible or vice versa) and directionality (e.g., positive or negative) of effects. The z-scores were subsequently transformed back into r values to present the results. Strength of effect size was interpreted using Plonsky and Oswald's (2014) field-specific benchmarks (r =.25,.40, and.60 for small, medium, and large effects, respectively). A within-group Q value (Q w ) was used as a measure of homogeneity for each group's effect sizes (to check the presence of a significant variation in true effect sizes across studies).
To calculate the phonological correlates of L2 comprehensibility and accentedness judgements, a total of 406 effect sizes (from 27 individual studies) were aggregated to produce a weighted mean effect size and 95% CI. A total of n = 274 effect sizes were obtained for comprehensibility from 20 studies, whereas n = 132 were obtained for accentedness from 14 studies. In line with similar meta-analysis projects on L2 pronunciation (Lee et al., 2015;Saito & Plonsky, 2019), an inclusive approach was adopted, allowing one primary study to contribute multiple effect sizes (M = 15.0 effect sizes per study, range = 4-25). Given that the studies operationalized the constructs of pronunciation in multiple ways (e.g., fluency as listeners' judgements vs. acoustic analyses of speech rate, pause ratio, and repetition frequency), the decision was made to include as many raw measures as possible (instead of aggregating them). As such, the current meta-analysis was assumed to capture a wide range of methodological variation in primary studies. 4 As summarized in Table 2, the overall relationship between the phonological predictors and global judgements was moderate-to-strong for both comprehensibility (r =.580, 95% CI [.546,.611], z = 13.427, p <.001) and accentedness (r =.589, 95% CI [.552,.624], z = 10.262 p <.001) (see also Figure 2). As indicated by the overlapping 95% CI values, the listeners' judgements of comprehensibility and accentedness were predicted by phonological accuracy information in particular 4 It needs to be acknowledged that the inclusive approach may leave the outcomes subject to the influence of a single study. This could be problematic, especially when primary studies feature outliers. According to Tables 3 and 4, CI values did not show much variation (e.g., 0.1-0.2). In addition, the results of Grubbs' tests failed to find any significant outliers in any contexts (p <.05). The results suggest that the dataset well represents how a range of phonological measures are related to L2 comprehensibility and accentedness among the 27 primary studies.
In terms of the phonological correlates of the judgements, betweengroup Q values were calculated to see whether the strength of the correlation coefficients differed according to the three dimensions (segmentals, prosody, and fluency). Statistical significance was reached for accentedness, Q(2) = 47.222, p <.001, but not for comprehensibility, Q (2) = 3.975, p =.137. As summarized in Table 2, and visually plotted in Figure 3, no significant difference (clear overlaps of 95% CI values) was found among the roles of segmentals, prosody, and fluency in L2 comprehensibility judgements. In contrast, the strength of the correlations between accentedness and the three dimensions of L2 speech (segmentals, prosody, and fluency) were clearly distinguishable at a p <.05 level. While judging accentedness, listeners appear to prioritize the three dimensions in the following order: Segmental accuracy (r =.792, 95% CI Listener Backgrounds. The final objective of the statistical analyses was to examine how the moderator variable (i.e., listener background) affected the influence of the phonological features on L2 comprehensibility and accentedness ratings. Descriptive results of the effect sizes according to listener type are summarized in Table 3 and visually plotted in Figure 4. According to the results of Q statistical analyses, the strength of the overall correlations between global and specific pronunciation features differed significantly for comprehensibility among the three different groups of listeners (i.e., expert, mixed vs. L2 listeners), Q(2) = 25.672, p <.001. The 95% CIs showed that the degree of dependence on phonological information during L2 comprehensibility judgements varied in the following order: experts (r =.  . Due to the lack of primary studies (n = 2), moderator analysis was not performed for L2 listeners for accentedness. Lastly, the role of listener background in L2 comprehensibility and accentedness judgements was analysed. As visually plotted in Figure 5, the results showed that for comprehensibility judgements, the listener factor did not seem to impact rating behaviours. All listeners drew equally on the dimensions of accuracy (segmental and prosodic) and fluency, Q(2) = 2.612, p =.271 for expert listeners, Q(2) = 2.006, p =.366 for L2 listeners, and Q(2) = 0.035, p =.982 for mixed.
Listener background played some role in the relationship between phonological factors and the evaluations of L2 accentedness. Overall, both expert and mixed listeners used segmental accuracy as a primary factor during their judgments, Q(2) = 34.681, p <.001 for expert, and Q(2) = 22.998, p <.001. Taking

STUDY 2: INTERVENTION RESEARCH
The results of Study 1 suggest that (a) L2 comprehensibility is linked to various phonological features; (b) L2 accentedness is tied to segmental accuracy; and (c) expert listeners rely more on segmental information in their judgements of accentedness in particular. Given that these suggestions were based on cross-sectional research, any discussion of causal relationships needs to be made with caution. To take a different look at the phonological characteristics of L2 comprehensibility and accentedness, scholars have also conducted training studies with a pre-and post-test design. These studies longitudinally examine how ESL students develop their pronunciation ability over time when receiving different forms of explicit pronunciation instruction (e.g., segmental-, prosody-, and fluency-based training). Study 2 was designed to meta-analyse the published intervention research so as to shed light on which pronunciation features most impact the development of L2 comprehensibility and accentedness. Lee et al., (2015) and Saito and Plonsly (2019) demonstrated that instruction could facilitate L2 pronunciation learning with small-tomedium effects (d = 0.80, 0.73). However, none of the studies further delved into the extent to which such instructional effectiveness differs when learning gains are assessed for comprehensibility vs. nativelikeness, and the extent to which type of instruction (segmental, prosody, vs. fluency-based training) could maximize the development of comprehensible vs. native-like L2 pronunciation proficiency. The current meta-analysis (Study 2) was designed to correspond to these concerns.

Study Retrieval and Inclusion and Exclusion Criteria
The same search procedures used in Study 1 were adopted here with the addition of the following key words: instruction, teaching, pre- and post-test, training, and intervention. The following inclusion criteria which derived from Study 1 were used: • A wide range of publications (journal articles, book chapter, conference proceedings, and PhD dissertations) were included.
• Comprehensibility and accentedness were used as outcome measures.
• Participants comprised ESL students.
Given that Study 2 relates to intervention studies, the following two new inclusion criteria were also added: • Explicit pronunciation instruction was provided (for the definition see below).
• Instructional gains (comprehensibility and accentedness) were measured via a pre-and post-test design.
As in Study 1, the date range was between 1995 and February 2020. The final dataset comprised 17 intervention studies involving 290 students (see Supporting Information). They also provided the necessary statistical information for the calculation of Cohen's d-that is, Mean, Standard Deviation, Standard Error or/and t values. For each study, effect sizes were calculated to index the extent to which participants improved in terms of the comprehensibility and/or nativelikeness of their L2 speech over time (i.e., pre-to post-tests; within-group contrasts). Due to the substantially small number of studies including a control group which did not receive any pronunciation instruction (n = 5 out of 17), effect sizes for the between-group contrasts (Experimental vs. Control) were not calculated. A total of 16 peer-reviewed journal articles and one PhD dissertation were included.

Coding
Type of pronunciation instruction was categorized into (a) segmental training, (b) prosody training, and (c) fluency training. Segmental training referred to the provision of explicit instruction on articulatory and perceptual characteristics of vowels and consonants that ESL learners likely have difficulty with (e.g., Saito, 2011 for English [ae, h, ð w, l, ɹ] for Japanese ESL students). Prosody training referred to the provision of explicit instruction on lexical and sentence stress and intonation (Levis & Levis, 2016 for the use of picture prompted comparison for sentence stress). Fluency training referred to encouraging the memorization, repetition, and/or reading aloud of already scripted sentences. The focus of the fluency training lies in clarity, fluidity, and smoothness of speech delivery rather than segmental and prosodic accuracy. Examples of this kind of training included listening to and repeating what they heard from podcasts (e.g., Foote & McDonough, 2018 for shadowing) and acting as an imaginary character by reciting scripted lines (e.g., Galante & Thomson, 2017 for drama-based techniques). The author and same linguistically trained coder as in Study 1 separately read the 17 intervention studies and coded the primary focus of pronunciation instruction (segmentals vs. prosody vs. fluency). There were no discrepancies in coding.

Results
Effect Size Aggregation. A total of 17 intervention studies produced 28 effect sizes (i.e., Cohen's d) for comprehensibility and 20 effect sizes for accentedness. As in Study 1, all effect sizes were submitted to a random effects model using the metafor package (Viechtbauer, 2010) in the R statistical environment (R Core Team, 2018). All indices were interpreted with reference to Plonsky and Oswald's (2014) benchmarks (r = 0.6, 1.0, and 1.4 for small, medium, and large effects, respectively). As summarized in Table 4, and visually plotted in Figure 6, pronunciation training significantly enhanced participants' comprehensibility and accentedness with small effects (d = 0.610, 0.278), as the 95% CI range did not include zero (0.479, 0.740 for comprehensibility; 0.115, 0.440 for accentedness). According to the results of between-group Q tests (Q b ), the difference was statistically significant between comprehensibility and accentedness, Q(2) = 4.243, p =.039. This indicates that the impact of instruction was larger for comprehensibility than accentedness.
Descriptive statistics on the effects of segmental-, prosody-, and fluency-based instruction on comprehensibility and accentedness are summarized in Table 5 and visually plotted in Figure 7. A betweengroup Q test confirmed that segmental, prosody, and fluency training differentially impacted L2 comprehensibility development and accentedness reduction at a p <.05 level, Q b (5) = 14.454, p =.013. According to the results of 95% CI analyses, all the instructional treatments demonstrated CI values above zero, although the lower end of the CI for segmental and prosody training crossed zero for accentedness (À0.246, À0.012, respectively). Thus, the following patterns were suggested: (a) the effectiveness of segmental and prosody training was significant for improving comprehensibility but not accentedness and (b) fluency training improves both comprehensibility and accentedness.  Listener Backgrounds. The final analysis concerns the extent to which listener background (i.e., expert vs. L2 vs. mixed) affects the perceptions of L2 comprehensibility development and accent reduction following pronunciation instruction. The descriptive results are summarized in Table 5 and visually plotted in Figure 8. According to the results of a between-group Q test, the role of listener background was found to be significant, Q b (4) = 14.502, p =.005. Mixed listeners perceived changes in comprehensibility and accentedness equally, as their 95% CI values were beyond zero. Similarly, L2 listeners' comprehensibility judgements were significant, as the lower end of the 95% CI was beyond zero. However, expert listeners may capture the impact of instruction on comprehensibility but not accentedness.

DISCUSSION
Global L2 English pronunciation proficiency has been extensively assessed using two inter-related but somewhat independent constructs -comprehensibility (ease of understanding) and accentedness (phonological nativelikeness). These constructs are typically operationalized through listeners' intuitive judgements. In Munro and Derwing's seminal work, degree of comprehensibility and accentedness was assumed to be determined by both the phonological properties of speech (e.g., accuracy and fluency errors; linguistic factors) and by listener background (e.g., amount of prior ESL/EFL teaching, linguistics training; other listener factors). The current study sought to examine these assumptions by meta-analysing the comprehensibility judgements of L2 English speech in 37 listener studies and 17 intervention studies. The analysis generated insightful findings as to the product and process of L2 comprehensibility and accentedness judgements in response to the three research questions.

R1: Reliability of Comprehensibility and Accentedness Judgements
According to the results of the reliability analysis, relatively strong inter-rater agreement was found for both comprehensibility and accentedness judgements (.896,.909), regardless of listener background (expert vs. L2 vs. mixed). The findings suggest that listeners have similar intuitions of which L2 English pronunciation forms are comprehensible and native-like.

R2: Phonological Correlates of Comprehensible and Nativelike Pronunciation
In terms of listener behaviours during L2 speech assessments, the results showed that approximately 30-50% of the variance in comprehensibility and accentedness judgements could be accounted for by phonological factors. According to Plonsky and Oswald's (2014) benchmarks, these are relatively large effects. The results suggest that listeners rely heavily on phonological accuracy and fluency during their instant and intuitive judgements of L2 speech. However, what distinguishes comprehensibility and accentedness seems to be the type of phonological information listeners actually use when rating. Whereas L2 comprehensibility ratings were equally associated with the dimensions included in the analysis, L2 accentedness ratings were strongly linked to segmental accuracy in particular.
The meta-analysis of intervention studies further revealed how listeners' perception of comprehensible and native-like pronunciation changes when judging the speech of ESL students who had received segmental, prosody, and/or fluency training. The results showed that the listeners' judgements of comprehensibility were equally associated with the type of instruction received (e.g., segmental, prosody, and fluency training). Comparatively, improvement in accentedness did not appear to be perceptible even if students received segmental and prosodic accuracy training through explicit instruction and/or practice. Given that accentedness proxies, the relatively difficult aspects of L2 speech, that is, the degree of phonological accuracy (see the results of Study 1), it is reasonable to assume that accentedness is resistant to change even when segmental and prosody-focused instruction is provided.
Given that the effectiveness of pronunciation teaching was found to be "small-to-medium" (e.g., Saito & Plonsky, 2019 for d =.73), the current study further demonstrated that such instructional effectiveness may vary according to different foci of assessment and training. Pronunciation teaching makes a "small-to-medium" difference when the focus of assessment highlights the comprehensibility of pronunciation (d =.61); and when the training focuses on fluency (d =.69). However, the effectiveness of training remains small or unclear if it is assessed for nativelikeness (d <.27) and it targets the acquisition of segmental accuracy in particular (d =.12). On the whole, Studies 1 and 2 support the robustness, reliability, and replicability of the findings (i.e., the phonological correlates of comprehensible and native-like pronunciation) across two different types of investigations (cross-sectional vs. longitudinal).

R3: Roles of Listener Factors in Assessment and Teaching of Pronunciation
Finally, the findings showed that there is some differential effect of listener background on L2 comprehensibility and accentedness judgements. In Study 1 (listener studies), expert listeners seemed to rely more on phonological information (segmental accuracy in particular) than novice and mixed listeners, especially when assessing accentedness. The results of Study 2 extend this finding by suggesting that expert listeners consider the impact of instruction to be minor when assessing the accentedness of speakers who had received segmental and prosodic accuracy training. Contrastingly, such listener effects were not clearly observed in any contexts of L2 comprehensibility judgements.
The results imply two possibilities regarding the consequences of using different types of listeners (i.e., mixed listeners) in research on L2 comprehensibility and accentedness. For those who show less reliance on phonological information, non-phonological factors, such as vocabulary (Appel, Trofimovich, Saito, Isaacs, & Webb, 2019), grammar (Ruivivar & Collins, 2018), collocational (Saito, 2020), and discourse knowledge (Trofimovich & Isaacs, 2012) could have been responsible for explaining the remaining variance in ratings (especially when it comes to L2 comprehensibility; see Saito, Trofimovich, Isaacs, & Webb, 2016). Alternatively, it may be that the mechanisms underlying the L2 comprehensibility and accentedness judgements of mixed listeners are inconsistent because of their intricate, multi-layered backgrounds, and potentially random rating behaviour. In other words, mixed listeners may use different types of strategies to arrive at their comprehensibility and accentedness scores (see Nagle, Trofimovich, & Bergeron, 2019 for the use of Idiodynamic Software). This would be a fruitful area of inquiry for future studies (cf. Magne et al., 2019 for quantitative and qualitative analysis of rater behaviours during L2 speech assessments).

Revising Framework of L2 Pronunciation Assessment, Teaching, and Development
The findings of the study provide crucial implications for theory building in L2 pronunciation assessment, teaching, and development (as summarized in Table 6). First, they support Derwing and Munro's (2015) listener and speaker model of L2 comprehensibility and accentedness. Two global constructs of L2 pronunciation proficiency-comprehensibility and accentedness-are readily distinguishable as listeners pay equal attention to various areas of phonological information for the former (segmentals = prosody = fluency) and prioritize segmental accuracy for the latter (segmentals > prosody > fluency). Second, listener effects are more clearly observed in accentedness than comprehensibility because expert listeners are more likely sensitive to the nativelikeness of segmental accuracy. The relationship between listener factors and comprehensibility may need to be examined beyond the focus of the current meta-analysis (i.e., phonological dimensions). 5 Third, as conceptualized by the Interaction Hypothesis (e.g., Mackey, 2012) and shown in empirical research (e.g., Derwing & Munro, 2013;Saito, 2015), improvement tends to occur in the comprehensibility rather than nativelikeness dimensions of language. This asymmetricity can be explained by the finding that comprehensibility improves as a collective effort of segmental, prosody, and fluency development; and that accentedness is resistant to change because its main componentsegmental accuracy-is subject to gradual, extensive, and individually different learning patterns (for longitudinal evidence, Saito, Suzuki, Oyama, & Akiyama, 2020; for more detailed accounts of this topic, see Flege & Bohn, 2020).

Implications for Practitioners
The different phonological correlates of L2 comprehensibility and accentedness identified here have several implications for pronunciation teaching and learning. On the one hand, given that many L2 learners are concerned with sounding native-like, teachers should focus on improving their segmental accuracy-the primary correlate of accentedness (Study 1). However, it is important to remind learners of the mounting evidence that few adult L2 learners can become nativelike in their pronunciation (e.g., Flege et al., 1995), as well as to make them aware that accentedness is likely to remain unchanged even after instruction (Study 2). This is because the refinement of L2 segmental accuracy is a slow, gradual, and extensive process, especially beyond the initial stages of learning (Flege et al., 1995;Saito, 2015). Furthermore, it may be subject to the influence of myriad individual differences including motivation (e.g., Moyer, 1999 for professional orientation and commitment; Nagle, 2018 for strong visions of future images), perceptual acuity (e.g., Saito, Kachlicka, et al., 2020), and cognitive functioning (Darcy, Park, & Yang, 2015 for working memory).
On the other hand, teachers should introduce a range of practice activities which help learners improve the various dimensions of their L2 proficiency in a balanced manner in order to help their students improve the comprehensibility of their speech. The focus of such activities can include segmentals (e.g., Munro & Derwing, 2006 for segmental contrasts with high functional load), prosody (Couper, 2006 for word stress; Saito & Saito, 2017 for intonation), and fluency (Suzuki, 2020;Thai & Boers, 2016; in press for timed repetition). In naturalistic L2 speech learning, there is ample cross-sectional and longitudinal evidence that (a) learners can quickly improve the fluency of their speech shortly after starting immersion and (b) L2 prosody will steadily develop as long as learners have access to ample interaction and immersion experience (Trofimovich & Baker, 2006). In fact, the current meta-analysis suggests that the impact of instruction can be clearly observed when speech is assessed in terms of comprehensibility (rather than accentedness).

Implications for Researchers
The current investigation took a first step towards meta-analysing the factors which are most relevant to the assessment and teaching of L2 pronunciation. Specifically, the studies focused on how different phonological dimensions of speech affect judgements of comprehensibility and accentedness, with listeners' backgrounds (expert vs. L2 vs. mixed) as a moderator variable. In light of the significance of the findings, and to provide implications for researchers in particular, I would like to end this paper by proposing the following topics worthy of future meta-analyses.
Intelligibility. While comprehensibility indexes listeners' ease of understanding, there is a consensus that what is ultimately important for communicative success is intelligibility, that is, interlocutors' actual understanding of intended message. Although intelligibility is well researched in the field of L2 pronunciation, scholars have continued to debate how to best measure the construct. A diverse range of methods have been used, including transcription, comprehension questions, scaler ratings, and reaction time instruments (for reviews on methodological fuzziness in L2 intelligibility research, see Isaacs, 2008;Kang, Thomson, & Moran, 2018;Munro & Derwing, 2011). Additionally, the existing literature has exclusively relied on audio information as a main source of understanding, although some studies have begun to examine whether, to what degree, and how audio and visual information differentially impact L2 intelligibility (e.g., Drijvers & € Ozy€ urek, 2019;Wheeler & Saito, forthcoming).
In a broad sense, comprehensibility is conceptually similar to intelligibility (relative to accentedness), as it was originally referred to as "native speakers' perception of intelligibility" (Derwing & Munro, 1997, p. 2). On a narrower level, comprehensibility is methodologically distinguishable from intelligibility, as it taps into the actual effort made to understand (measured via scaler ratings), as opposed to the actual outcome of understanding (assessed via a wide range of measures, such as transcription and comprehension questions). While I make a strong call for future meta-analysis studies to further pursue the mechanisms underlying comprehensibility and intelligibility, it also needs to be emphasized that the latter construct should be surveyed with much caution. It may be advisable to wait for more empirically robust methods to be established and for more primary studies using them to be published.
Task. One topic that future meta-analysis studies should explore concerns the conditions of different speaking tasks. Crowther and his colleagues have begun to demonstrate that the phonological correlates of L2 comprehensibility and accentedness may vary according to task structure (Crowther, Trofimovich, Issacs, & Saito, 2015 for simple vs. complex) and formality (Crowther, Trofimovich, Saito, & Issacs, 2018 for academic vs. non-academic). Another distinction relates to controlled vs. spontaneous tasks (Saito & Plonsky, 2019) and structured vs. unstructured (Saito & Liu, 2021). With a sufficient number of primary studies for a robust moderator analysis, future studies can examine the effects of task structure on L2 comprehensibility and accentedness as per any existing task frameworks in the task-based language learning literature (e.g., Robinson, 2011).
Fluency. One interesting finding of the current meta-analysis is that fluency training appears to be equally important for the assessment and development of comprehensible and native-like speech. The existing literature suggests that fluency factors (speech rate, pause frequency) explain a large degree of variance in listeners' judgements of comprehensibility (Suzuki & Kormos, 2020) and accentedness (Trofimovich & Baker, 2006), and that quick, perceptible improvement can be observed in the fluency (rather than accuracy) dimension of L2 speech in both naturalistic and classroom settings (Mora & Valls-Ferrer, 2012;Saito & Hanzawa, 2018). The argument here echoes a growing amount of theoretical discussion that fluency serves as a crucial component of speaking proficiency (see Foster, 2020 for a comprehensive overview). It would be intriguing for future studies to elaborate on more research-based approaches to fluency instruction, measure its impact on both comprehensibility and accentedness (cf. Suzuki, 2020;Thai & Boers, 2016;Tran & Saito, in press), and promote practitioners' awareness towards the relative importance of fluency (over accuracy) as a component of speaking proficiency (Tavakoli, 2020).

ACKNOWLEDGMENTS
I am grateful to Yui Suzukida for her assistance for data. I would like to thank three anonymous TESOL Quarterly reviewers for their helpful input, feedback and advice at every stage of the manuscript writing and revising processes. The project is funded by Leverhulme Trust Research Grant (RPG-2019-039).

THE AUTHOR
Kazuya Saito is an Associate Professor in Applied Linguistics at University College London, UK. His research interests include how second language learners develop various dimensions of their speech in naturalistic settings; and how instruction can help optimize such learning processes in classroom contexts.