Does Mode of Input Affect How Second Language Learners Create Form–Meaning Connections and Pronounce Second Language Words?

This study examined how mode of input affects the learning of pronunciation and form–meaning connection of second language (L2) words. Seventy-five Japanese learners of English were randomly assigned to 1 of 3 conditions (reading while listening, reading only, listening only), studied 40 low- frequency words while viewing their corresponding pictures, and completed a picture-naming test 3 times (before, immediately, and about 6 days after treatment). The elicited speech samples were as- sessed for form–meaning connection (spoken form recall) and pronunciation accuracy (accentedness, comprehensibility). Results showed that the reading-while-listening group recalled a significantly greater number of spoken word forms than did the listening-only group. Learners in the reading-while-listening and listening-only modes were judged to be less accented and more comprehensible compared to learn- ers in the reading-only mode. However, only learners receiving spoken input without orthographic support retained more target-like (less accented) pronunciation compared to learners receiving only written input. Furthermore, sound–spelling consistency of words significantly moderated the degree to which different learning modes impacted pronunciation learning. Taken together, the findings suggest that si- multaneous presentation of written and spoken forms is optimal for the development of form–meaning connection and comprehensibility of novel words but that provision of only spoken input may be bene- ficial for the attainment of target-like accent.

LEARNERS ENCOUNTERING THE WRITTEN forms of second language (L2) words tend to acquire more vocabulary than learners encountering their spoken forms (Brown et al., 2008;Vidal, 2011). However, mounting evidence reveals the value of spoken input when it is used as an additional mode of input to support reading (Brown et al., 2008;Bürki, 2010;Webb & Chang, 2012. Learners pick up more words from reading texts with auditory support than reading without such support (Malone, 2018;Webb & Chang, 2012). The benefit of bimodal input has also been corroborated by findings that learners tend to acquire more words through watching L2 television with captions than without captions (Perez et al., 2013). However, earlier studies have not focused on L2 learners' productive knowledge (i.e., pronunciation), for the most part using written measures of form-meaning connection-for example, choosing first language (L1) translations corresponding to L2 orthographic word forms provided-and targeting receptive knowledge of spoken forms (i.e., recognition). This is surprising because pronunciation is considered an important aspect of word knowledge (Nation, 2013) and essential for successful oral communication (Derwing & Munro, 2015). Lack of attention to pronunciation and the overuse of written measures may underestimate the value of encountering words in speech. Critically, despite a few studies measuring L2 pronunciation learning (e.g., Bürki, 2010), no research has examined how mode of input affects how comprehensibly (easy for listeners to understand) L2 learners produce novel words. Because L2 speakers can be sufficiently comprehensible despite having a noticeable foreign accent (Munro & Derwing, 1995) and because increasing comprehensibility is an appropriate goal of pronunciation teaching in globalized contexts (Levis, 2005), it is important to ensure that learners can produce the spoken forms of L2 words and that the produced forms are sufficiently comprehensible to the listener so that L2 speakers are successful in oral communication. Therefore, the present study aimed to examine the value of spoken input for developing two aspects of L2 learners' vocabulary knowledge: pronunciation (measured through comprehensibility and accentedness) and form-meaning connection (measured through spoken form recall) by comparing three input conditions: reading while listening (RWL), listening only (LO), and reading only (RO).

WRITTEN AND SPOKEN INPUT AND L2 VOCABULARY LEARNING
Research on incidental vocabulary learning has documented that learning occurs as a by-product of exposure to written input, such as reading short sentences (Webb, 2007) and reading graded readers (Brown et al., 2008). Researchers often measure learning gains in terms of form-meaning connection by asking learners whether they can recognize and recall word meanings and forms in written format. Studies have demonstrated that vocabulary learning also occurs from exposure to spoken input (e.g., listening to TV interviews and lectures; van Zeeland & Schmitt, 2013), yet learners appear to acquire more words from written input than from spoken input (Brown et al., 2008;Hatami, 2017;Vidal, 2011). For example, Vidal (2011) assigned first-year university students studying English as a foreign language (EFL) to two groups, either listening to academic lectures or reading the transcribed texts. Participants were tested in written format before, immediately after, and 1 month after the treatment using a lexical developmental scale assessing knowledge ranging from partial (i.e., form recognition) to full competence (i.e., the ability to use the word in a sentence). Vidal concluded that reading was a more efficient source of input for vocabulary learning than listening, particularly for low-proficiency learners, who might have experienced difficulty segmenting connected speech for text processing and comprehension. However, the limited benefit of spoken input has been documented primarily through research focusing on word knowledge in terms of form-meaning connection in written format (Brown et al., 2008;Vidal, 2011) or recognition of spoken forms, for instance, via a multiplechoice test (Hatami, 2017).
Spoken input is considered useful for vocabulary learning as an additional mode to support reading (Brown et al., 2008;Bürki, 2010;Bürki et al., 2019;Malone, 2018;Webb & Chang, 2012. Brown et al. (2008) compared vocabulary learning in three modes-RWL, RO, and LOwith Japanese university students studying three graded readers. Different levels of form-meaning connection of target words were measured using tests of meaning recognition (a multiple-choice test) and meaning recall (an L2-to-L1 translation test) in written format. Participants showed the greatest learning gains in all test formats in the RWL condition, followed by the RO and then LO conditions. Studies following up on Brown et al. support the advantage of RWL over RO, targeting participants of different L2 proficiencies (e.g., L2 beginners: Webb & Chang, 2012) and using different test formats (e.g., form recognition: Malone, 2018; collocation recognition: Webb & Chang, 2020). Unlike previous studies focusing on contextualized learning, Bisson et al. (2014) had L2 beginners engage with a decontextualized activity (a letter-search task) where their attention was drawn to spellings of target words while receiving exposure to the spoken forms and pictorial information. In that study, as little as two exposures to the multimodal stimuli led to significant improvement of translation recognition accuracy (see also Krepel et al., 2021, for more evidence supporting multimodal effects on L2 vocabulary learning through a decontextualized activity using translation practice). The attested advantage of multimodal input over unimodal input aligns with the cognitive theory of multimedia learning (Mayer, 2014). According to this model, presenting simultaneous modalities (e.g., written and spoken modes) leads to greater learning outcomes, such that success in learning depends on how multiple sensory systems are employed to integrate both verbal and nonverbal information into coherent mental representations (Niegeman & Heidig, 2012).
As reviewed previously, most work on incidental vocabulary learning tends to measure receptive knowledge (e.g., Brown et al., 2008;Vidal, 2011;Webb & Chang, 2020), with few studies examining how mode of input affects productive knowledge of spoken forms (i.e., pronunciation). This is reasonable given that incidental learning research is most concerned with documenting the amount of learning as a result of engagement with contextualized and meaning-focused activities where the main focus is on communicative content (Uchihara et al., 2019). Because the lack of explicit or deliberate focus on L2 words stands little chance of substantially improving productive knowledge-which is more challenging to acquire than receptive knowledge (Laufer & Goldstein, 2004)-productive knowledge is rarely measured as a focus of incidental vocabulary research. As rare exceptions, two studies involving decontextualized and word-focused activities compared RWL with unimodal conditions (either RO or LO), measuring learning gains with production tests of spoken form and formmeaning connection (Bürki, 2010;Bürki et al., 2019). Bürki (2010) compared the effectiveness of an audio-supported paired-associate learning approach (i.e., combination of written and spoken input) with that of a traditional pairedassociate learning approach (i.e., only written input) in the learning of multiple aspects of word knowledge including form-meaning connection (L1-to-L2 form recall in written mode) and pronunciation (with productions of words elicited via a word-reading task assessed for lexical stress and segmental accuracy). L1 Korean participants studying L2 English words in the audio-supported condition showed a significantly higher rate of written form recall and pronunciation accuracy in comparison to those in the RO condition. On the other hand, Bürki et al. (2019) compared RWL with LO conditions, in which L1 French learners studied English-like pseudowords in a paired-associate format while viewing the meanings conveyed through corresponding pictorial information. Spoken form recall was measured through a picture-naming test, and accuracy of vowel production was assessed with acoustic analysis and listener judgment. The learners exposed to L2 orthographic input recalled significantly more spoken forms than those receiving only spoken input, but the learners with orthographic support substituted significantly more L1 sounds erroneously in vowel production. Audrey Bürki et al. concluded that exposure to written forms facilitates formmeaning mapping but leads to non-target-like pronunciation. It should be noted, however, that Andreas Bürki elicited production of L2 words using a controlled task (i.e., word reading), making it unclear whether learners could accurately pronounce L2 words spontaneously (without reading them). Also, neither study adopted global measures of L2 pronunciation, such as word intelligibility or comprehensibility. Given that the first hurdle that learners need to overcome is to become understandable to listeners (Levis, 2005), assessing the degree of listener comprehension of L2 speech would increase the ecological validity of the pronunciation measures.

ORTHOGRAPHIC INFLUENCE IN L2 PRONUNCIATION LEARNING
Studies investigating the role of orthographic input in L2 phonological learning have produced inconsistent findings (for a review, see Bassetti, 2008;Hayes-Harb & Barrios, 2021), suggesting that orthography can have both positive (Erdener & Burnham, 2005;Solier et al., 2019) and negative effects (Bassetti & Atkinson, 2015;Bürki et al., 2019). These mixed findings could be due to the degree to which L1 and L2 orthographic systems overlap with or deviate from each other. For example, Bürki et al. (2019) attributed their finding of the negative influence of orthography to the incongruencies of the grapheme-to-phoneme conversion rules between L1 French and L2 English. Participants saw orthographies involving "i" and "o" which can be pronounced in English as /ɪ/ (e.g., "pick," "kick," "sick") and /ɑ/ (e.g., "log," "hot," "cod"), respectively; however, /ɪ/ is absent and "o" never corresponds to /ɑ/ in French. Thus, learners provided with orthographic support may have relied on their L1 orthographic system and substituted L1 vowels for their L2 counterparts, resulting in orthography-induced, non-target-like pronunciations (Bassetti, 2008). Similar negative effects of orthography on L2 pronunciation due to L1-L2 incongruency in orthographic systems have been documented in the studies of L1 Italian speakers learning L2 English words containing double letters such as "Finnish" versus "finish" (Cerni et al., 2019), L1 English speakers learning L2 German-like words containing final voiced obstruents such as "Steid" versus "Steit" (Hayes-Harb et al., 2018), and L1-English speakers learning L2 Spanish words containing a range of incongruent segments such as "ll" pronounced as /j/ in English (Rafat, 2016).
Another factor concerns the extent to which an orthographic system deviates from one-to-one grapheme-to-phoneme correspondences, or orthographic depth, which is conceptualized on a transparent-to-opaque continuum. Spanish is a good example of a transparent language, with the exception of a few letters (i.e., "v," "b," "c," and "ll") that can correspond to two phonemes. In contrast, English has a rather opaque language system with many instances of graphemes corresponding to two or more phonemes such as "i" as /ɪ/ (e.g., "pick"), /i/ (e.g., "taxi"), and /aɪ/ (e.g., "kite"). It is hypothesized that L1 users of phonologically transparent writing systems rely on L2 orthographic input more than L1 users of phonologically opaque writing systems (Bassetti, 2008). Erdener and Burnham (2005) tested and supported this hypothesis, investigating whether two groups of participants speaking L1 Turkish (transparent) and L1 English (opaque) could accurately repeat L2 words in two target languages: Spanish (transparent) and Irish (opaque). All learners pronounced target words more accurately when they viewed a written representation of the words. However, the benefit of orthographic input was greater for L1 Turkish speakers than L1 English speakers in production of L2 Spanish words, probably because L1 users of the phonologically transparent language could make better use of L2 orthographic input in processing L2 auditory input. In contrast, when repeating L2 Irish words, L1 Turkish speakers were negatively impacted by the orthographic representation, while L1 English speakers were not.

GLOBAL CONSTRUCTS OF PRONUNCIATION: COMPREHENSIBILITY AND ACCENTEDNESS
Since Munro and Derwing's (1995) seminal study, several global constructs, including comprehensibility and accentedness, have been widely researched in L2 pronunciation studies (Derwing & Munro, 2015). Comprehensibility refers to listeners' perceived ease or difficulty in understand-ing L2 speech, and accentedness is defined as listeners' judgments of how different L2 speech sounds from the expected language variety. These two constructs are typically measured through listeners' ratings of speakers, using numerical point scales (e.g., 1 = easy to understand, 9 = hard to understand; 1 = no accent, 9 = heavily accented). A possible issue regarding the validity of global measures of pronunciation (particularly accentedness) is the question of who should be evaluating L2 speech. In research, the target variety of a given language as spoken by L1 speakers is often set as a reference point. However, this "native variety" frequently varies among speakers across different contexts, so researchers recruit listeners, with their specific experiential profiles, to represent a particular reference point (e.g., as trained L1-speaking raters assessing learners' speech or as naïve interlocutors interacting with learners). For instance, to provide a reference point for accentedness, researchers might sample listeners with homogeneous backgrounds, such as those who come from the same regional variety (e.g., British or American English) and/or those who have knowledge of learners' L1. The extent to which various listener characteristics impact accentedness ratings has been researched but remains inconclusive (Saito, 2021). One way to confirm rating reliability is to examine interrater consistency to determine that it is sufficiently high for research purposes. Although listeners normally do not receive any detailed instruction as to the rating procedure except brief descriptions of the target constructs (e.g., accentedness, comprehensibility), high interrater consistency has been achieved regardless of various differences across listeners, for instance, in their L1 versus L2 status or prior training in linguistics and phonetics (e.g., see Saito, 2021, for a review).
Since Munro & Derwing (1995), researchers have measured L2 accentedness and comprehensibility with the aim of tracking the development of L2 pronunciation proficiency in naturalistic environments (e.g., Derwing & Munro, 2013;Saito, 2015) and instructional settings (e.g., Nagle, 2018;Saito & Hanzawa, 2018). The accumulated evidence suggests that comprehensibility improves with increased L2 exposure and intensive instruction, whereas attaining target-like pronunciation remains a difficult task (Saito, 2021). L2 learners could improve in their production of some acoustic features determining segmental and suprasegmental accuracy, but other features are more difficult to improve (e.g., labial, alveolar, and pharyngeal constrictions) despite continued immersion or instruction (Trofimovich & Baker, 2006). For example, Japanese speakers initially aim to attain the comprehensible and intelligible form of a North American variety of English /r/ by using interlanguage strategies (tongue retraction, phonemic lengthening). After much immersion experience, some learners appear to approximate more target-like English /r/ as they learn to produce novel acoustic (F3) and articulatory configurations, including labial, alveolar, and pharyngeal constrictions (Saito & Brajot, 2013). Reliance on a limited repertoire of acoustic features might be sufficient to render L2 speech comprehensible but not necessarily target-like. While a small improvement of a learner's accent can be expected with practice, estimated through a recent metaanalysis to be a relatively small effect (Cohen's d = .28) following instruction (Saito, 2021), sounding nonaccented might require a substantial amount of immersion experience (Flege & Fletcher, 1992;Trofimovich & Baker, 2006), an earlier age of onset (Flege et al., 2006), strong motivation (Moyer, 2014), and special language learning abilities (see Suzukida, 2021, for a review), including phonemic coding (e.g., Hu et al., 2013) and perceptual acuity (e.g., Saito et al., 2020).
In light of these empirical findings, many scholars have emphasized the importance of assessing L2 speech in terms of both L2 comprehensibility and accentedness, as these measures might distinguish three different developmental stages of adult L2 speech learning (Derwing & Munro, 2013;Saito, 2021). At the onset of development, L2 speakers initially produce weakly specified (heavily accented) forms requiring considerable effort for listeners to understand (low comprehensibility). Through more exposure and ample conversational opportunities, L2 speakers might achieve communicatively adequate production (comprehensible but accented speech). Ultimately, some L2 speakers can master the phonological detail characteristic of advanced proficiency (comprehensible and target-like speech). Using this developmental account of L2 speech learning, the current study set out to explore how training can help learners acquire not only comprehensible pronunciation forms of target L2 words (which would correspond to the initial-to-mid stage of L2 speech learning) but also more target-like, refined, and advanced pronunciations of L2 words (corresponding to the mid-to-final stages of L2 speech learning).

MOTIVATION FOR THE CURRENT STUDY
There are several reasons why more research is needed to investigate the effects of mode of in-put on L2 vocabulary acquisition. First, our understanding of the value of spoken input is largely based on the findings from research on incidental vocabulary learning that measured vocabulary learning in written form. In order to advance our insights into how and when spoken input facilitates (or inhibits) vocabulary learning, more research needs to employ different instructional approaches (e.g., explicit focus on L2 words vs. incidental learning) and measure other aspects of word knowledge beyond written mode of formmeaning connection.
Second, little is known about how mode of input affects productive knowledge of spoken forms (i.e., pronunciation). Although two studies investigated input modality using measures of pronunciation and form-meaning connection (RWL vs. RO in Bürki, 2010;RWL vs. LO in Bürki et al., 2019), neither of them compared RO versus LO. In order to determine the true value of spoken input, it is necessary to compare all three modality types at one time and examine the relative contribution of the three modes to vocabulary learning.
Third, previous studies used pronunciation measures focusing on target-like accuracy (e.g., a forced-choice identification task by L1 listeners) and provided little insight into the degree to which listeners understand L2 speech. Given that instructed L2 speech learning is a multifaceted phenomenon that needs to be examined from multiple angles, it is important to include both comprehensibility (as a fundamental and achievable goal) and accentedness (as a specialized and advanced-level goal; Derwing & Munro, 2015).
Finally, investigation of the extent to which (in)congruencies between spellings and sounds affect pronunciation acquisition was limited to segmental features (e.g., vowels and consonants), and studies have yet to examine the effects of sound-spelling consistency at the word level. Therefore, the present study, which adopted a decontextualized and deliberate learning procedure (i.e., paired-associate word learning) and measured spoken forms of L2 words, was guided by the following research questions: Based on the cognitive theory of multimedia learning (Mayer, 2014) and findings of earlier studies (Bürki, 2010;Bürki et al., 2019;Malone, 2018;Webb & Chang, 2012), the RWL mode was predicted to facilitate learners' development of form-meaning connections to a greater degree than the RO or LO mode. This is because exposure to multimodal input (audio and orthographic) can help L2 learners access greater linguistic resources from different angles, resulting in deeper processing and greater acquisition of new words. For pronunciation measures, learners receiving spoken input (RWL and LO) were expected to perform better than those receiving only written input (RO). However, the predicted superiority of the spoken input modes over the RO mode might be gradually reduced as sound-spelling consistency of target words increases. Conversely, an additional mode of input might place demands on learners' limited cognitive capacity and result in a negative impact for multimodal input on learning, particularly in the present study where participants received only one exposure to each target word (Baddeley, 1986). Last, different effects of mode of input might arise for different pronunciation measures. Learners receiving written input might sound more heavily accented than learners receiving only spoken input because the availability of orthographic information triggers grapheme-to-phoneme recoding applying L2 and L1 conversion rules, so that learners' production of L2 words is influenced by their L1 (Bürki et al., 2019). Because L1 influence might be more detrimental for listener judgments of accentedness than comprehensibility, a negative effect of orthography might be reduced for comprehensibility compared to accentedness, particularly as sound-spelling consistency of words increases.

Overview of the Study
The study adopted a pretest-posttest design with three experimental groups (RO, LO, and RWL) and three testing trials (pretest, immediate posttest, and delayed posttest). Participants were randomly assigned to three experimental groups, which encountered target words in different modes of input: RO, LO, and RWL. During the treatment, participants learned 40 English words through seeing and/or hearing the words while viewing their corresponding pictures. A picture-naming test was administered at the three testing times, and the elicited samples were evaluated for form-meaning connection and pronunciation measures.

Participants
Seventy-nine Japanese university EFL students in Japan participated in this experiment. Four participants were excluded from the analysis, because three had lived abroad for an extended period of time (5-12 years) and one did not complete the delayed posttest. The remaining 75 participants (M age = 19.5, range = 18-24) had studied English for a minimum of 6 years in instructional settings. All participants except one had scored 90% or higher on the 1,000-word level of the Vocabulary Levels Test (Webb et al., 2017), and all except one had scored 80% or higher on the 2,000-word level of the test. Their mean score at the 2,000-word level was 28.76, indicating that they had receptive knowledge of almost all of the most frequent 2,000 words. The 75 participants were randomly assigned to three experimental groups: RO (n = 25), LO (n = 25), and RWL (n = 25). There was no between-group difference in vocabulary test scores, F(2,72) = .70, p = .503, η p 2 = .02. All participants reported normal hearing.

Target Items
Forty target words were selected according to three criteria. First, a pool of low-frequency words was created by collecting English words that were beyond the most frequent 5,000-word families in Nation's (2017) British National Corpus and Corpus of Contemporary American English (BNC/COCA) word lists. Second, because the treatment involved learning written and spoken forms attached to meanings conveyed in visual images (pictures), only concrete nouns were selected as target items. Third, words that could be replaced with high-frequency synonyms were avoided to reduce the possibility that highfrequency synonyms of the target items would be produced in the picture-naming test. The selected items were measured in terms of sound-spelling consistency (i.e., the degree to which the pronunciation of a word matches its spelling). Using consistency norms for English words developed by Chee et al. (2020), a feedforward (i.e., spelling-tosound) rime consistency score was calculated for each target word. This score accounts for the frequencies of similarly spelled words for a given pronunciation (e.g., "-oar" can be regarded as consistent due to many instances of words which contain the rime and are pronounced similarly among the words such as "soar," "boar," and "hoar"). To illustrate using the consistency scores from this dataset, "toupee" (.128) is less consistent than "spatula" (.476) or "parakeet" (.525). The score for three words (i.e., "abalone," "loquat," "maracas") was not available and hence not analyzed in this study (see Table 1 for target items and consistency scores).
Each of the 40 target words was recorded twice by a female Canadian speaker of English from Ontario using a TASCAM DR-05 audio recorder and digitized into a wav format at a sampling rate of 44.1 kHz (16-bit resolution). The better of the two productions was selected according to clarity, naturalness, and lack of background noise and then stored as an individual sound file, with peak intensity normalized using digital speech-analysis software (Praat; Boersma & Weenink, 2014). The stimuli were clear and comprehensible based on the judgment of another L1 English speaker. Pilot testing showed that two L1 English speakers successfully identified all 40 productions recorded by the model speaker.

Treatment and Testing
Paired-associate vocabulary learning was implemented as the learning intervention for three reasons: It allowed for careful control of the presentation of the target items, it has been found to positively contribute to learning the written forms of words, and it has been used frequently in studies of vocabulary learning (Nation & Webb, 2011). The learning and testing schedule was programmed with PsychoPy (Peirce, 2007). Before the treatment began, participants put on headphones equipped with a microphone (AT810 Cardioid headset microphone) and familiarized themselves with the vocabulary learning task by working through three practice examples. During the treatment, participants encountered the meanings of the target words conveyed in visual images (i.e., copyright-free pictures retrieved from the Internet, standardized to a size of 400 × 400 pixels) while seeing and/or hearing the target word forms. For each target item, the picture was displayed on the computer screen for 4 seconds. For the conditions involving spoken input (LO and RWL), the auditory presentation of the target word began 750 milliseconds after the picture appeared. For the conditions receiving written input (RO and RWL), the orthographic presentation of the target word appeared under the corresponding picture for 4 seconds. A 2-second blank interval was inserted between trials.
During the treatment, the 40 target items were presented in a sequence of eight blocks of five items. The experimental groups received exposure to each of the 40 target items once in one of three different modes of input (LO, RO, and RWL). For all groups, the order of item presentation was randomized across participants. Immediately after the final exposure to each block of five items, a picture-naming test was administered. In the picture-naming test, participants were presented with the same pictures that were presented during the learning trial and asked to orally produce twice each of the words corresponding to the pictures shown on the computer screen. If participants did not remember a word, they were instructed to move to the next item. Their speech was recorded with a TASCAM DR-05 audio recorder and digitized into a wav format at a sampling rate of 44.1 kHz with 16-bit resolution. One out of two productions per word (i.e., a speech sample without fillers or self-corrections during articulation) was selected and stored in an individual sound file, with peak intensity normalized using Praat (Boersma & Weenink, 2014). The same test procedure (except for exposure treatment) was adopted for both pretest and delayed posttest. Prior to data collection, issues with clarity of visual stimuli, trial procedures, and testing procedures were resolved through a pilot study with 20 university students with a similar learning background. Data for pilot study participants were not included in the main data analysis (visual stimuli are available in Online Supporting Information A).

Procedure
The experiment was conducted over two sessions on two different days. On Day 1, participants took the pretest, completed the treatment, an immediate posttest, and the Vocabulary Levels Test. On Day 2, approximately 6 days (M = 6.1, SD = 3.6) 1 after the first session, participants completed a surprise delayed posttest and filled out language background questionnaires. The test format (i.e., picture naming) across three time points was the same, except that 10 highfrequency items were added to the pretest to boost motivation. The 10 high-frequency items were not included in the analyses. Participants were asked to learn the English words and forewarned that they would be asked to produce words in response to pictures immediately after the learning trials. Participants in the RO condition were told that they would see the spellings of words without any auditory information presented. Participants in the RWL condition were told that they would see and hear target words simultaneously. The treatment and tests were conducted individually with the researcher or a research assistant. All speech samples were recorded in a soundattenuated booth. A total of 4,061 speech samples were elicited from 75 speakers on the pretest, immediate posttest, and delayed posttest and evaluated for form-meaning connection and pronunciation measures.

Form-Meaning Connection and Pronunciation Measures
To assess form-meaning connection, spoken form recall (e.g., production of accurate forms of words in a picture-naming test) was measured. Form recall is considered the most difficult measure of form-meaning knowledge compared to three other measures: form recognition, meaning recognition, and meaning recall (Laufer & Goldstein, 2004). For pronunciation measures, following Derwing & Munro (2015), two constructs were measured: accentedness (i.e., listener rating of the extent to which learners' word productions deviated from an L1 variety of the target language) and comprehensibility (i.e., listener rating of the degree of effort needed to comprehend learners' word productions).
To measure three aspects of word knowledge (spoken form recall, accentedness, comprehensibility), six L1 speakers of Canadian English from Ontario (three females, three males) were recruited to participate in a series of rating sessions. Speakers of Canadian English were chosen because they represented the variety of English that was readily accessible to the researchers and that provided a reasonable L1-speaker benchmark for assessing the pronunciation of L2 learners. Three of the six speakers had language teaching experience in EFL and English-as-a-second language (ESL) contexts. All six speakers had no hearing problems and were highly familiar with Japaneseaccented English (M = 5.1, range = 4-6 in response to 1 = not familiar at all, 6 = very familiar). Raters completed a word listening task programmed using PsychoPy (Peirce, 2007). In this task, raters first listened to each of the speech samples and pressed an "f" key for correct and a "j" key for incorrect word pronunciation. Pronunciation was considered correct if it was sufficiently intelligible with minor errors or foreign accents present (Kang et al., 2013). Raters were first presented with 40 target words produced by the model speaker as a reference point and asked to evaluate whether L2 speech samples were intelligible to an average speaker of their variety of English. For some of the words having multiple variant pronunciations (e.g., /baɪnάkjʊlərz/ and /bɪnάkjʊlərz/ for "binoculars"), the one that was produced by the model speaker was considered as the expected target sound. Form recall was coded dichotomously with 1 point assigned to responses judged as correct by all six raters and 0 points to responses judged as incorrect by one or more raters or missing responses (i.e., failure to name pictures).
Second, for samples judged as correct, listeners rated accentedness (1 = not accented at all, 5 = heavily accented) and comprehensibility (1 = easy to understand, 5 = hard to understand). The 5-point numerical scale was adopted because, in contrast to earlier studies measuring L2 speech at sentence or discourse levels (e.g., through a 9-point scale in Munro & Derwing, 1995), this study focused on words as unit-of-speech samples. Given the relatively limited amount of linguistic information available at the word level, using a large number of scale points might make the rating task excessively challenging or even confusing. Also, for intuitive L2 speech judgments of this kind, rating performance using a 5-point scale could be as reliable as when a 9-point scale is used (Isaacs & Thomson, 2013). A pilot study also confirmed that the choice of a 5-point rating scale was appropriate for rating word pronunciation in this study. Prior to main rating sessions, raters first familiarized themselves with the 40 target words and rating procedure through completing a practice listening task with 50 items (not included for analysis in this study). They then listened to each of the speech samples from the main dataset, completed a binary rating task (correct vs. incorrect), and rated accentedness and comprehensibility for items they had judged as correct. Raters were presented with 41 blocks of 100 samples and a final block of 41 samples. These samples consisted of random selection of pretest, immediate posttest, and delayed posttest items, as well as L1 speakers' samples (included as distracter items), totaling 4,141 items (4,061 from Japanese speakers + 80 from English speakers). The inclusion of the English speaker samples also allowed us to confirm the reliability of raters' performance. Recordings were played only once. In the first meeting with the researcher, the raters first practiced rating 50 samples and then rated the first block of 100 samples. Raters subsequently evaluated the remaining samples in their own time.

Data Analysis
Preliminary analysis of raters' responses to English speaker samples showed that raters consistently judged the English-speaker baseline as correct (100% accuracy), the least accented (94% of samples were rated as 1 = not accented at all), and the easiest to understand (99% of samples were rated as 1 = easy to understand). The interrater reliability for accentedness (α = .75) and comprehensibility (α = .72) was satisfactory for research purposes (α > .70; Larson-Hall, 2010). These preliminary results confirmed the reliability of the raters' performance and their understanding of the tasks. In response to the first RQ, data of form recall (1 = correct, 0 = incorrect) were analyzed in a generalized linear mixed-effects model with a binomial distribution (Jaeger, 2008). The fixed factors included (dummy-coded) mode of input (LO, RO, RWL), (dummy-coded) time (pretest, immediate posttest, delayed posttest), and the interaction term. We included random intercepts for participant (75 levels), word (40 levels), and rater (6 levels), a by-word random slope for the mode-of-input factor, and the correlation between the slope and the intercept. Before conducting analyses to answer the second and third RQs, accentedness and comprehensibility ratings were calculated only for responses to the target items that learners did not recall at pretest but recalled after treatment, such that pronunciation scores reflected the development of the spoken forms of unfamiliar words. The resulting data points (or observations) for accentedness and comprehensibility were 10,434 cases (1,739 × 6 raters). Data of accentedness and comprehensibility were analyzed in a mixed-effects model. The fixed factors included (dummy-coded) mode of input (LO, RO, RWL), (dummy-coded) time (immediate and delayed posttests), (grand-mean centered) sound-spelling consistency, and all of the interactions between them (stepwise model comparison was not adopted here). We included random intercepts for participant (75 levels), word (40 levels), and rater (6 levels), a by-word random slope for the mode-of-input factor, a by-participant random slope for the consistency factor, and the correlations between the slopes and the intercepts.

RESULTS
Descriptive statistics of spoken form recall, accentedness, and comprehensibility are presented in Table 2. Changes in scores for spoken form recall, accentedness, and comprehensibility between different test time points are illustrated in Figures 1-3. Full results of mixed-effects modeling conducted to answer RQs (i.e., variance components, R-squared, random parameter correlations, and model fit indices) can be found in Online Supporting Information B.
For comprehensibility ratings at immediate posttest, learners receiving spoken input or both spoken and written input simultaneously were perceived as significantly more comprehensible than those receiving written input only, LO versus RO: β = .36, SE = .12, t = 3.05, d = .86, p = .003; RWL versus RO: β = .33, SE = .10, t = 3.27, d = .92, p = .002. There was no significant difference between the LO and RWL groups, β = .03, SE = .10, t = .35, d = .10, p = .73. At delayed posttest, no significant differences were observed between the three groups with the trend that RWL was perceived to be more comprehensible than RO, LO versus RO: β = .27, SE = .17, t = 1.63, d = .46, p = .109; LO versus RWL: β = .04, SE = .14, t = .27, d = .08, p = .788; RWL versus RO: β = .23, SE = .13, t = 1.75, d = .49, p = .087. Sound-spelling consistency of words was in general negatively associated with pronunciation measures-accentedness, β = −.64, SE = .24, t = −2.70, p = .001; comprehensibility, β = −1.11, SE = .32, t = −3.46, p = .001-indicating that productions of consistent words tended to be perceived as more target-like and comprehensible than those of inconsistent words. However, Figures 4-7 illustrate that the effect of consistency appeared to vary across experimental groups and testing times. At immediate posttest for both accentedness and comprehensibility, the strength of the relationship between consistency and pronunciation measures was significantly different between the LO and RO groups as well as between the LO and RWL groups, accentedness, RO versus LO: β = −.88, SE = .38, t = −2.33, p = .024; RWL versus LO: β = −.81, SE = .31, t = −2.57, p = .014; comprehensibility, RO versus LO: β = −1.43, SE = .52, t = −2.73, p = .009; RWL versus LO: β = −1.08, SE = .41, t = −2.64, p = .011, indicating that the extent to which productions of consistent words become more target-like and comprehensible was greater for learners receiving written input (RO and RWL) than those receiving spoken input only (LO). Such an effect was not found when two groups (RWL and RO) receiving written input were compared either for accentedness, β = .08, SE = .32, t = .24, p = .814, or comprehensibility, β = .    Finally, a follow-up analysis was conducted to examine whether learners in the written-input conditions (RO and RWL) could perform better than those in the spoken-input-only condition (LO) when learning words that are highly consistent in sound-spelling correspondence. The target words with consistency scores (i.e., 37 items) were sequenced in order of consistency and organized into three categories: 12 low-consistency words, 13 mid-consistency words, and 12 high-consistency words. A mixed-effects modeling analysis was conducted on the set of 12 high-consistency words. The analysis showed a similar pattern of the results found in the original analysis with all target words included. Although no significant betweengroups differences were found for the delayed posttest result, the LO and RWL groups significantly outperformed RO at immediate posttest for accentedness, RO versus LO: β = .40, SE = .12, t = 3.37, p = .002; RO versus RWL: β = .44, SE = .15, t = 3.03, p = .006, and comprehensibility, RO versus LO: β = .36, SE = .15, t = 2.44, p = .022; RO versus RWL: β = .38, SE = .17, t = 2.19, p = .041. No significant differences were found between the LO and RWL groups for accentedness, β = .04, SE = .10, t = .38, p = .709, or comprehensibility, β = .02, SE = .10, t = .16, p = .872. These findings together indicated that the effect of soundspelling consistency was larger for RO and RWL compared to LO, yet learning gains in accentedness and comprehensibility were larger for RWL and LO in comparison to RO; additionally, gains for RWL were comparable to gains for LO, regardless of the degree to which target words were consistent in sound-spelling correspondence.

DISCUSSION
Overall results showed that learners in the RWL group recalled a significantly larger number of spoken word forms than learners in the LO group, which aligned with earlier research findings showing that reading with auditory support is an effective way to build form-meaning connection for L2 words (Brown et al., 2008;Bürki et al., 2019;Krepel et al., 2021;Malone, 2018;Webb & Chang, 2012). Also, learners in the RWL and LO groups produced L2 words in a manner that listeners perceived to be less accented and more comprehensible compared to words produced by learners assigned to the RO mode. Although our understanding of the value of spoken input might have been biased by the findings of earlier studies measuring vocabulary gains in written format, the current study confirmed the important role of spoken input when pronunciation of novel words was measured. Furthermore, sound-spelling consistency of words had a significantly larger impact on accentedness and comprehensibility in the RWL and RO conditions compared to the LO condition. However, no difference was found between RWL and RO, indicating that learners in the RWL mode processed orthographic information to the same extent as did learners in the RO mode. The superiority of RWL in (a) processing orthographic input (vs. LO) and (b) enhancing formmeaning connection (vs. LO) and pronunciation (vs. RO) reveals that learners could process and benefit from two modes of input presented simultaneously at one exposure without being excessively impacted by a potential increase in cognitive workload. These findings help reaffirm the pedagogical value of RWL with the goal of enhancing multifaceted aspects of word knowledge including form-meaning connection and pronunciation. Further interpretation and discussion of the results follow in response to each of the three RQs.
In answer to the first RQ, at the immediate posttest, significantly larger gains in recall of spoken forms were observed in RO (53%) than LO (48%), but no significant difference was found between RWL and RO (48%) or between LO and RO. At the delayed posttest, no significant differences were found across the three groups, but a similar pattern emerged with RWL (36%) leading to the greatest gains, while LO (33%) led to the smallest gains. The larger gains for RWL align with earlier research findings (Brown et al., 2008;Bürki, 2010;Bürki et al., 2019;Krepel et al., 2021;Malone, 2018;Webb & Chang, 2012). Despite the statistically significant results, these bimodal effects were relatively moderate in magnitude when considered from a practical standpoint (RWL group learners gained 5%, or 1.6 words, more than the LO group on the immediate posttest). No clear advantage of RWL over RO or of RO over LO appeared to contrast with previous findings of contextualized vocabulary learning: RWL outperforms RO (Webb & Chang, 2012) and RO outperforms LO (Vidal, 2011). Because the target items were presented in isolation in this study, auditory support most likely did not help participants either divide written texts into meaningful chunks of language (Webb & Chang, 2012) or segment connected speech (Vidal, 2011) for improving text processing and comprehension. Furthermore, the lack of significant differences between RO and LO was not expected and merits further discussion. It was predicted that compared to the LO condition, learners in the RO condition would show smaller gains due to the absence of congruency between the input and test modes (Jelani & Boers, 2018). These findings suggest that learners encountering the written forms of words might perform as well as learners receiving spoken input in building form-meaning connections of L2 words and in achieving a minimum-threshold accuracy of the phonological forms according to the criteria set in this study (i.e., judged as acceptable by all six L1 listeners).
In answer to the second RQ, the findings that the RWL and LO groups outperformed the RO group for accentedness and comprehensibility with medium-to-large effects (d = .86-1.19) at the immediate posttest suggest that encountering spoken input is beneficial for the development of productive knowledge of spoken forms. The absence of a significant difference between RWL and LO at the immediate and delayed posttests indicated that the orthographic representation did not help learners produce L2 words in a more target-like or comprehensible manner. One possible reason for this is the crosslinguistic influence of orthographic depth in participants' L1 (Japanese) and L2 (English; Erdener & Burnham, 2005). Although Japanese is not an alphabetic language, Japanese L1 speakers use the L1 romanization system (i.e., Roomaji) to represent L2 English, which is considered phonologically transparent. For example, "o" corresponds to a single phoneme /o/ in Japanese but can be pronounced differently in English, such as /ə/ (e.g., "computer"), /ɑ/ (e.g., "hot"), and /oʊ/ (e.g., "token"). In this example, the orthographic information presented in RWL and RO might negatively affect L2 pronunciation accuracy because learners tend to apply L1 graphemephoneme conversion rules to recoding L2 orthographic forms into L2 sounds. This may result in spoken production involving segmental errors, such as substituting L1 sounds (e.g., "toboggan": /təbɑɡən/→*/tobogən/) and devoicing L2 consonants (e.g., "chisel": /tʃízəl/→*/tʃísəl/). However, unlike previous studies focusing on segments (Bürki et al., 2019), this study did not show any significantly negative effect of L2 orthography on L2 pronunciation. Perhaps segmental errors resulting from erroneous recoding of L2 written to spoken forms were compensated by accurate pronunciation of the remaining parts, so that the errors might not have a significantly negative impact on the listener judgment of the whole word. Approximately 1 week after the treatment, the advantage of spoken input over written input for accentedness rating retained for LO versus RO, β = .35, p = .010, but not for RWL versus RO, β = .20, p = .071, suggesting that orthographic input might prevent learners from reducing the degree of foreign accent in the long term. Given that speech is transient and orthography is permanent, learners in the RWL mode might still have access to target-like phonological forms of L2 words in their memory immediately after the treatment, allowing them to produce the spoken forms more accurately than learners in the RO mode. However, at the delayed posttest, since the visual orthographic trace of the word remains accessible longer than the phonological information (Solier et al., 2019), learners might have relied on the orthographic representation to recode the written forms into L2 sounds using L1 (and L2) grapheme-phoneme conversion rules. As a result, the recalled spoken forms might have been as heavily accented as the forms elicited from learners receiving only written input. For comprehensibility rating, the benefit of spoken input was not durable either for RWL versus RO, β = .23, p = .087, or LO versus RO, β = .27, p = .109. Crosslinguistic influence from L1 grapheme-phoneme conversion might have less of an impact on how easily or effortlessly spoken words are understood regardless of the presence or absence of foreign accent.
In answer to the third RQ, words of higher sound-spelling consistency were in general perceived to be less accented and more comprehensible than words of lower consistency, but the extent to which consistency of words impacted listener judgments differed across groups and test times. The results of the immediate posttest showed stronger consistency effects for RWL and RO compared to LO for accentedness and comprehensibility, aligning with our prediction, because RWL and RO were the only conditions where learners were exposed to the spellings of words, which likely triggered orthographic recoding. However, we did not expect the consistency effect to become stronger in the LO condition from the immediate to delayed posttest, as evidenced by the finding that the significant between-group differences in the effect of consistency initially observed at the immediate posttest disappeared at the delayed posttest. This result likely occurred because immediately after the treatment, participants in the LO mode could produce spoken word forms-whether consistent or inconsistentbecause the phonological representation of the words remained available in their working memory (considering that the knowledge of words was tested after each block of five items). However, given the transient nature of auditory information, success in recall of L2 forms at the delayed posttest might have been largely dependent on the orthographic representations of the words, developed as a result of the phonology-toorthography recoding at the exposure phase. The recoding process might have been executed more easily and successfully for words of higher consistency, therefore enabling learners to be more accurate at pronouncing consistent words than inconsistent words. This explanation is speculative and the role of sound-spelling consistency in L2 pronunciation acquisition needs to be further investigated in future studies. Finally, a follow-up analysis of the high-consistency words confirmed that RO yielded the fewest learning gains for pronunciation accuracy of the three modes regardless of the degree to which target items were consistent. This finding suggests that exposure to the written form alone may not be sufficient in order for pronunciation of L2 words to be improved, even though these words are highly consistent in their sound-spelling correspondence.

IMPLICATIONS, LIMITATIONS, AND FUTURE DIRECTIONS
The current study provides methodological and pedagogical implications for assessing and teaching L2 vocabulary. First, learners encountering the written forms of L2 words without the availability of spoken input could achieve a minimum-threshold level of production accuracy. However, the role of spoken input comes into play in further enhancing the form accuracy, such that newly learned spoken words might be perceived by listeners as less accented and easier to understand. Second, ideally, learners should be presented both the written and spoken forms of L2 words together so that knowledge of formmeaning connection and pronunciation can develop simultaneously. In many instructional contexts where spoken input is limited outside the classroom, learners tend to devote most of their time to studying the written forms of words, for example, through reading written texts intensively, using flashcards and word lists, and writing the spellings of words repeatedly. It is important for language teachers to ensure that learners are exposed to the spoken forms of words by teaching strategies, such as encouraging learners to listen to the pronunciation of unfamiliar words when looking them up in online dictionaries; choosing vocabulary exercise books or textbooks that include audio support; using vocabulary learning applications that have the function to present the spoken forms of words; watching L2 television, movies, and video clips with captioning options available (e.g., YouTube); and listening to other audio materials (e.g., songs, podcasts, radio). Second, the superiority of spoken input over written input persisted for even words that are highly consistent in their sound-spelling correspondence, suggesting that if the primary goal of L2 instruction is to enhance the spoken forms of words, spoken input always needs to be introduced even when the pronunciations of new words are easily inferred from the spellings of words.
Finally, exposure to spoken input without orthographic support helps L2 speakers develop target-like pronunciation of words in the long term, as evidenced by the finding that it was only the LO condition that maintained its advantage over the RO condition for accentedness. Learners' full attention may need to be drawn to phonological details without being distracted by the presence of orthographic representation if the pedagogical focus is on accent reduction. Given that many scholars have emphasized the importance of setting a realistic goal, such as the development of comprehensible rather than target-like pronunciation forms (Derwing & Munro, 2015), we argue that RWL may be an optimal method for developing L2 oral skills relative to RO (typical of foreign language education) and LO (characteristic of naturalistic immersion). While written modality enables students to develop and reinforce stronger form-meaning mappings for new words (Vidal, 2011), audio modality can help students reach the minimum threshold for successful understanding (comprehensibility rather than accent reduction) in an efficient and effective manner (Derwing & Munro, 2015).
The present study has several limitations, which should be considered in future studies investigating how mode of input affects L2 word learning. First, participants received only one exposure to each of the target items in the treatment, and the inefficiency of learning was evident from a large drop between immediate and delayed posttest scores for spoken form recall. To increase the ecological validity of the research and improve learning, different numbers of repetitions should be explored in future studies. The findings of the current study might then be used as a baseline for comparison. Second, we urge caution with generalization of the findings because they are restricted to a specific population of learners (L1 Japanese, an orthographically transparent language) and target language (L2 English, an orthographically opaque language), and may not apply to other situations where, for example, learners' L1 is opaque (e.g., English) and their L2 is transparent (e.g., Spanish). Third, this study did not explore the influence of individual differences, such as L2 proficiency and auditory processing skills (Saito et al., 2020) on vocabulary learning. Although exploration of learner-related variables was beyond the scope of this study, individual differences play an important role in L2 pronunciation learning (for a review, see Suzukida, 2021). How learner-internal variables interact with mode of input needs to be investigated further. Fourth, findings of this study based on the learning of concrete words may not generalize to a situation where learners study abstract words because concrete and imageable meanings are easier to learn than abstract meanings (Ellis & Beaton, 1993). The extent to which varying degrees of imageability affect word pronunciation learning remains unknown, which is also worth investigating in the future. Finally, testing word knowledge via phonological form for the RO group, who received only written input, might not be ecologically valid. Essentially, participants in the RO condition were required to draw on grapheme-to-phoneme translations to articulate target words. In this study, the RO condition served as a baseline group, and the data were employed to explore the extent to which exposure to written forms, especially when words of higher consistency were targeted, could lead to comprehensible word pronunciation. This methodological decision was also pedagogically motivated given the fact that vocabulary learning tends to be implemented via written mode, especially in EFL contexts (Nation, 2013). Moreover, understanding the benefits and limitations of learning L2 words from written input alone provides useful pedagogical implications. Yet, to increase the ecological validity in light of learning-test congruency, future studies should also measure knowledge of written forms of words, which will allow researchers to evaluate both advantages and disadvantages of written input for multifaceted aspects of vocabulary learning more comprehensively.