Measuring global oral proficiency in SLA research: A new elicited imitation test of L2 Chinese



This article describes a new Chinese Elicited Imitation Test (EIT) and reports on a study that investigated the degree to which it functions as a tool that can be used in second language acquisition research to gauge global second language (L2) oral proficiency. Eighty L2 Chinese learners, sampled from two university curricular levels so as to represent high and low linguistic abilities and including both heritage and foreign language learners, participated in the study by completing the EIT as well as an oral narrative task and a background questionnaire. The results suggest that the new Chinese EIT can help measure overall oral linguistic proficiency in L2 Chinese for a variety of research purposes.


A hallmark of research, policy, and practice in the field of foreign language education is the existence of large-scale standardized proficiency tests that are designed to elicit encompassing evidence of the range of functional language abilities and communicative competencies in a target language. In the United States, the ACTFL Guidelines (ACTFL, 2012) offer the theoretical backdrop for a number of widely known and used standardized tests, such as the ACTFL Oral Proficiency Interview (OPI) and the ACTFL Oral Proficiency Interview Computer Test (OPIc). In Europe, the CEFR levels (Council of Europe, 2009) have become widespread for the testing of proficiency across educational and professional settings. In the context of standardized testing, the construct of proficiency is defined as abilities that underlie “what individuals can do with language in terms of speaking, writing, listening, and reading in real-world situations in a spontaneous and non-rehearsed context” (ACTFL, 2012, p. 3). The ACTFL standardized proficiency tests, in particular, are principled instruments that

compare a person's unrehearsed ability against a set of language descriptors—the ACTFL Proficiency Guidelines. The ACTFL Proficiency Guidelines describe proficiency along a continuum from the very top (full professional proficiency) of a scale to the very bottom (little or no functional ability).

(, n.p.)

ACTFL tests that are specifically designed to measure oral proficiency “evaluate speech that is either Interpersonal (interactive, two-way communication) or Presentational (one-way, non-interactive)” (ACTFL, 2012, p. 4), as do other oral speaking tests developed by researchers for in-house use (e.g., Jin & Mak, 2013). Assessment of a person's unrehearsed ability to speak a language in the real world, “regardless of where, when, or how the language was acquired” (ACTFL, 2012, p. 3), is of utmost importance for key purposes to which standardized proficiency testing is put in educational contexts, such as measuring student learning outcomes, evaluating program quality, or accrediting instructors (Byrnes, 2006; Tarone, 2013).

A different construct definition of proficiency, and the one that is the focus of the present article, can be found in the field of second language acquisition (SLA). In the SLA context, language proficiency is typically understood as learners' knowledge and automated ability for use of core vocabulary and grammar delivered with reasonably intelligible pronunciation and fluency, or what Hulstijn (2011, 2012) recently termed the core components of Basic Language Cognition. While this narrow definition of proficiency in SLA would clearly be insufficient to characterize full communicative oral proficiency in a language, it is sufficient for many research purposes in SLA, such as characterizing and grouping participants in terms of their second language (L2) proficiency or documenting and controlling participants' relative linguistic abilities within a study. In SLA research, language proficiency is oftentimes the most important moderating or secondary variable that needs to be taken into consideration when investigating another L2 phenomenon of central interest (Hulstijn, 2011, 2012; Norris & Ortega, 2012; Thomas, 1994, 2006). Thus, a pressing need in the field of SLA has always been the establishment of global oral L2 measures by which participants' basic linguistic proficiency can be validly and reliably gauged for a variety of research purposes. A well-known technique for the elicitation of language proficiency evidence in psycholinguistics, bilingualism, and child first language acquisition is elicited imitation (Slobin & Welsh, 1973). The technique requires participants to listen and then repeat as exactly as possible a number of sentences that either are kept at a certain constant length or become increasingly longer.

Aiming to meet the current needs of the field of SLA in general and of Chinese SLA in particular, the present study examined the degree to which a newly developed L2 Chinese Elicited Imitation Test (EIT) functioned as a global oral L2 measure of basic linguistic proficiency by comparing the performance of 80 participants who were purposefully sampled from two curricular levels, low and high. In addition, equal numbers of foreign and heritage Chinese learners were sampled within each level in order to examine the extent to which the new Chinese EIT would be useful across diverse learning backgrounds that are typical of many Chinese educational contexts. Finally, participants' EIT performances were compared against criterion measures of the linguistic quality of interlanguage production similar to those also traditionally used in SLA research (e.g., Housen, Kuiken, & Vedder, 2012). The goal was to investigate the value of the EIT as a tool that may meet a wide range of research needs for measuring oral global proficiency in the field of Chinese SLA.

Literature Review

The Quest for Proficiency Measurement Tools in SLA Research

The lack of agreement on how to best address the measurement needs of SLA researchers has been long recognized as a problem in the field, and the problematic nature of proficiency measurement practices in SLA research has been empirically documented.

In two syntheses conducted 12 years apart, Thomas (1994, 2006) inspected typical strategies for characterizing participants' proficiency across 157 SLA studies published in flagship SLA journals between 1988 and 1992 and again across a new set of 211 studies published between 2000 and 2004. She found that the most frequent means for researchers to characterize a study sample's proficiency was by declaring that participants were sampled from an institutional context based on their assigned curricular levels, such as “fourth semester,” “second year,” “level five of six levels,” and so on. This approach to grouping students for research purposes occurred in 40% of the 1994 study set and in 33% of the 2006 set. Another common practice that, used in 21% of 157 studies and 19% of 211 studies, was for researchers to simply use impressionistic descriptors, such as intermediate, false beginners, or advanced. Thomas termed these two strategies the institutional status and impressionistic definitions of L2 proficiency in SLA research and noted that they are insufficiently rigorous or informative to allow for valid interpretations within and across studies. More recently, Tremblay (2011) inspected 144 SLA journal articles published between 2000 and 2008 and lamented that only 37% of them measured participants' L2 proficiency independently using an assessment for which validity and reliability information had been previously established. In the remaining 63% of studies, researchers simply reported the institutional status of their participants; stated their length of instruction or length of residence in an L2 environment; or, at best, collected indirect information that may have been outdated or unreliable, such as self-reported pre-existing standardized proficiency scores or self-assessed proficiency (Tremblay, 2011, n.p.). While some characterization of L2 proficiency in studies is better than none, the findings by Thomas (1994, 2006) and Tremblay (2011) point at a bleak picture for SLA research. Simply put, without a trustworthy and interpretable estimation of participants' proficiency that goes beyond indirect or subjective indicators, the internal and external validity of research can be called into question, because interpretations of main study findings are obscured and generalization to other contexts beyond a given study is problematic.

One may wonder why SLA researchers have so rarely made use of standardized tests, given that a large number are available for many target languages. However, in terms of practicality, administering standardized language tests such as these in a study could be both costly and time-consuming. In addition, in many cases the methodological reasons for measuring proficiency in studies are important but narrow—researchers need to evaluate participants' basic linguistic abilities, instead of the full communicative and functional competencies that frameworks such as ACTFL and CEFR are designed to measure.

This lack of valid and practical L2 proficiency assessment tools is especially problematic for L2 Chinese researchers because a tool that quickly and conveniently estimates basic linguistic ability or global oral proficiency is either not available or too impractical. Standardized Chinese proficiency tests, such as the Hanyu Shuiping Kaoshi (HSK) in the People's Republic of China (see or the Advanced Placement (AP) Chinese test in the United States (see, typically take two to three hours to complete, making them unwieldy for research purposes. Although the OPI and the OPIc are available for Mandarin Chinese and are less demanding of participants' time (up to 30 minutes), they are nevertheless costly because the tests must be administered and scored individually by certified interviewers and/or raters. Finally, defining the proficiency levels of participants in a study impressionistically or by reference to their institutional status may prove to be an even more pressing concern for L2 Chinese researchers, given the widely varying proficiency levels that often coexist within and across levels in a program and the heterogeneous language profiles associated with Chinese heritage vs. foreign language learning backgrounds (Ke & Li, 2011).

EIT in SLA Research

In SLA, elicited imitation has been used as a research tool throughout the last 40 years (see particularly, in chronological order, Naiman, 1974; Henning, 1983; Bley-Vroman & Chaudron, 1994; Vinther, 2002; Jessop, Suzuki, & Tomita, 2007; Cox & Davies, 2012). The technique of elicited imitation requires participants to listen, and then repeat as exactly as possible, a number of sentences that either are kept at a certain constant length or become increasingly longer. The theoretical rationale, laid out by Slobin and Welsh (1973) and repeated since then, rests with the idea that when the length of a sentence exceeds short-term memory capacity, participants will not be able to parrot it. Instead, they will need to comprehend and decode the sentence, recall and reconstruct it with their own grammar, and only then be able to reproduce it. In other words, repeating oral sentences is thought to tap into the ability to process meaningful language receptively and productively, to demand the integration of verbatim memory of sentences with syntactic and semantic knowledge of the language system from long-term memory, and to call also for the deployment of psychomotor skills needed for meaningful speech production in real time. Some empirical evidence in support of Slobin and Welsh's theoretical explanation for how elicited imitation works has been furnished with L2 English EIT data from two recent studies. Okura and Lonsdale (2012) found that the EIT performances from 40 participants correlated with their institutional level status but not with their working memory scores. Graham, McGhee, and Millard (2010) found that increasing sentence length, while not the sole source of difficulty, accounted for 73% of the variance in the EIT scores from 81 participants (p. 67).

EITs have been used to assess different dimensions that demand real-time, integrative oral linguistic skills, such as to measure global L2 oral proficiency (e.g., Hameyer, 1980; Henning, 1983), L2 syntactic knowledge (e.g., Naiman, 1974; Ellis, 2005; Ellis, Loewen, & Erlam, 2006, Erlam, 2006; Schimke, 2011; Weitze & Lonsdale, 2011), morphology development (West, 2012), and even L2 listening comprehension (e.g., Jensen & Vinther, 2003; compare Cox & Davies, 2012, who reported a correlation of r = 0.74 between their EIT and a listening test). Some researchers have also deemed it necessary to go beyond direct repetition of stimuli and have introduced additional elements in the elicited imitation procedures. For example, Erlam (2006; also Ellis et al., 2006) designed an EIT composed of 68 statements that included both grammatical and ungrammatical English structures. Participants were instructed to answer an opinion question about whether a statement was true or false after hearing it once. Immediately thereafter, they were required to repeat the statement in correct English. The opinion question was inserted to ensure that the participants did not explicitly focus on the linguistic form of the statements. The ungrammatical sentence repetition was added to the design because the researcher reasoned that any repetition of the ungrammatical sentences that contained spontaneous corrections would provide a good indication of participants' implicit grammatical knowledge.

Variations in EIT focus and design notwithstanding, most existing studies, reviewed by Vinther (2002) and Jessop et al. (2007) as well as more recent empirical studies (see Cox & Davies, 2012), suggest that EITs offer a good estimate of learners' global oral L2 proficiency levels when following the traditional procedures of participants simply repeating oral stimuli upon hearing them, as recommended by Slobin and Welsh (1973) for first language acquisition.

What Kind of Proficiency Does an EIT Measure?

In the field of SLA, Hulstijn (2011) usefully theorized the construct of L2 proficiency as being composed of four interrelated dimensions. First, he posited, there is higher language cognition (HLC) and basic language cognition (BLC). HLC varies widely by individual and is attained by those well-educated language users—native or nonnative—who develop advanced competencies in specialized discourses and literate uses of language. BLC, on the other hand, does not vary greatly among native speakers and is recruited when a person uses oral language in any communicative, everyday situation. It includes:

(a) the largely implicit, unconscious knowledge in the domains of phonetics, prosody, phonology, morphology and syntax; (b) the largely explicit, conscious knowledge in the lexical domain (form-meaning mappings), in combination with (c) the automaticity with which these types of knowledge can be processed. BLC is restricted to frequent lexical items and frequent grammatical structures, that is, to lexical items and morphosyntactic structures that may occur in any communicative situation, common to all adult L1-ers, regardless of age, literacy, or educational level.

(p. 203, italics in original)

Second, Hulstijn distinguished between core and peripheral components of L2 proficiency within both HLC and BLC. Core components are the “more purely linguistic competences,” and peripheral components are “less purely linguistic competences, such as metacognitive and strategic competences” (p. 239, legend to Hulstijn's Figure 3). It is posited here that elicited imitation as a technique elicits L2 oral performances that tap the core proficiency components of BLC in Hulstijn's (2011) model of the construct.

Other testing techniques can also be used for measuring core components of BLC for research purposes in SLA. They include, for example, vocabulary size tests, grammaticality judgment tasks, or the family of cloze tests and c-tests, where full words or half-words, respectively, are deleted from a text and must be supplied by the learner. EIT offers certain advantages when compared to these techniques. For example, vocabulary size tests measure linguistic knowledge of a strictly noncommunicative kind, untimed grammaticality judgment tasks delivered via written stimuli likely rely on explicit knowledge, and cloze tests and c-tests require relatively competent reading literacy skills. The technique of elicited imitation, by contrast, taps implicit grammatical knowledge available for online use insofar as it demands repetition of full sentences under oral and time-limited conditions. In addition, it does not involve reading or writing; thus it can be used with beginning-level students and with learners from low-literacy backgrounds. Elicited imitation performances therefore provide insights into learners' knowledge and automated ability for use of core vocabulary and grammar delivered with reasonably intelligible pronunciation and fluency. While the dimensions of vocabulary, grammar, pronunciation, and fluency are insufficient in and of themselves to characterize full communicative oral proficiency in a language, they are known to closely affect scores on speaking tests (e.g., in L2 Chinese, see Jin & Mak, 2013).

In sum, then, the construct that elicited imitation as a technique intends to measure lies on a continuum between two extremes: on the one hand, maximally communicative proficiency in speaking and listening such as those BLC as well as some HLC components elicited by standardized ACTFL or CEFR proficiency tests, both core and peripheral, and on the other hand, linguistic skills in isolation such as those elicited by entirely noncommunicative alternatives such as a test of vocabulary size, a grammaticality judgment task, or a cloze test or c-test.

A New EIT for L2 Chinese

Ortega, Iwashita, Norris, and Rabie (2002) developed and tested parallel versions of the same EIT across four foreign languages (English, German, Japanese, and Spanish) with the original goal of creating a set of EITs that could serve to support cross-linguistic research in SLA. They found that the EIT performances consistently yielded high reliability coefficients and successfully reflected differences in each language between groups of participants sampled from lower-division courses and upper-division courses in the respective foreign language curricula studied. Recently, Tracy-Ventura, McManus, Norris, and Ortega (2013) designed and tested a French version of the EIT with a sample of study abroad students and reported good construct validity. This accumulated research suggests that the EIT is a promising tool for use by SLA researchers, as the continuous scores yielded by EIT performances can serve a number of research needs, such as characterizing and grouping participants by linguistic skills or statistically controlling and accounting for pre-existing differences among participants (Hulstijn, 2012; Norris & Ortega, 2012; Thomas, 2006; Tremblay, 2011).

Motivated by the encouraging findings reported for the parallel EITs for English, German, Japanese, and Spanish by Ortega et al. (2002), Zhou and Wu (2009) designed a Mandarin version of the English EIT. The Chinese EIT version closely followed the test specifications set by Ortega et al. (2002) and was pilot-tested with 23 L2 Chinese learners sampled from low (n = 11) and high (n = 12) curricular levels (Zhou & Wu, 2009). The participants for the pilot were recruited from institutions other than the one where the present study was conducted. An independent-sample t test showed that there was a significant difference in scores for the high (m = 71.33, sd = 16.7) and low (m = 27.36, sd = 14.17) groups; t(21) = 6.78, p = 0.00, with a very large effect size of Cohen's d = 2.84 (2009, n.p.). This suggested that the Chinese EIT successfully discriminated between groups posited to belong to widely different proficiency levels according to their institutional status.

As is customary practice in language assessment (Brown, 2005), the quality of the Zhou and Wu (2009) test and its items were examined with the pilot data. In terms of the quality of the test as a whole, Cronbach's alpha was chosen as a well-known index of reliability that estimates the internal consistency of a set of test scores on a norm-referenced test whose items are scored polytomously. The logic and uses of Cronbach's alpha are similar to those of the also widely employed Kuder-Richardson formulas 20 and 21; the latter are appropriate only for dichotomous yes/no or 0/1 scoring, whereas Cronbach's alpha is used when polytomous scoring is involved, as in the present study. The resulting value of .97 was much higher than the standard of acceptable reliability, which is typically set at .70 (for more details, see Allen & Yen, 2002). This suggested that the 30 items in the instrument produced good across-individual variance and trustworthy scores, and thus the sentences in it held good potential for eliciting reasonably consistent performances to allow for empirical inferences about the construct of interest. In terms of the quality of the test items, a traditional item difficulty and item discrimination analysis (Brown, 2005) was conducted, and some minor item revisions were made as needed.

Following the process of development, piloting, and fine-tuning just described, the EIT was then used by both S.-L. Wu (2011b) and Zhou (2012) in their doctoral dissertation studies, which focused on the expression of motion events in L2 Mandarin and on the relationship between “willingness to communicate” and Mandarin L2 proficiency, respectively. In the present study, the elicited imitation data originally collected from a large sample of Chinese learners of varying curricular levels and learning backgrounds by the first author (S.-L. Wu, 2011b) were reanalyzed, and additional evidence was sought in order to ascertain whether the new L2 Chinese EIT could be used with a success similar to that reported for the parallel English, French, German, Japanese, and Spanish EITs (Ortega et al., 2002; Tracy-Ventura et al., 2013).

Differences Between Heritage and Foreign Language Learning

The organization and sequencing of L2 Chinese classes in postsecondary settings in the United States, as well as in other countries, often results in heritage learners and learners of Chinese as a foreign language within the same mixed classes. This practice was true of the curricular context in which the present study was conducted. Yet studies have shown that heritage language learners (HLLs), as a result of their early home or community exposure to the language and culture through parents and other family members, exhibit different learning profiles from foreign language learners (FLLs) (see Au & Romo, 1997; Comanaru & Noels, 2009; Kim, 2001; Kondo-Brown, 2005; McGinnis, 1996; Wen, 2011). For instance, HLLs were found to be more confident in their listening ability and to have more heterogeneous competencies in speaking, reading, and writing skills than their FLL counterparts (e.g., Kim, 2001; McGinnis, 1996). In terms of grammatical ability, S.-L. Wu (2011a) observed that HLLs outperformed FLLs in their ability to incorporate the hither/thither perspectives when using Chinese directional complements such as flan12063-gra-0001 pǎo jìn-lái “run into-hither” and flan12063-gra-0002 bān shàng- “move up-thither.” It was therefore of interest for the present study to not only explore whether the Chinese EIT could discriminate between the two curricular levels from which participants had been sampled, but also to examine how well it could reflect differences between HLLs and FLLs within each of the two levels.

Criterion Indicators of Global Oral L2 Proficiency: CAF Measures

This study also provided a comparison of the EIT data with evidence gleaned from interlanguage production measures frequently employed in SLA research. The goal was to better determine the extent to which elicited imitation performances yielded reliable and valid evidence that correlated well with an L2 learner's strengths and weaknesses in different oral linguistic dimensions such as those tapped by the chosen interlanguage measures.

The complexity, accuracy, and fluency qualities exhibited in L2 learners' oral production (Ellis, 2003, 2008; Skehan, 1998) are often quantified into a single measurement that is then taken to be an index of global linguistic performance levels in the field of SLA; the acronym CAF has begun to gain popularity to refer to this kind of measurement (see Housen et al., 2012; Norris & Ortega, 2009). Complexity is commonly characterized as the elaborateness, richness, and diversity of the L2 linguistic system as deployed in oral production, and it includes aspects such as lexical, syntactic, or propositional complexity (e.g., Ellis, 2003; Ellis & Barkhuizen, 2005; Ortega, 2003). Accuracy is the degree of conformity to the L2 norms, which is usually measured by comparison to target-like use (Pallotti, 2009; Wolfe-Quintero, Inagaki, & Kim, 1998). Fluency refers to the native-like speed and smoothness of speech or writing (Lennon, 1990; Skehan, 2009). Norris and Ortega (2009) suggested that the three traits of complexity, accuracy, and fluency (CAF) are dynamic and interrelated and that not all indexes of CAF have an equal predictive value for different levels of language ability. Focusing on only one aspect of CAF may lead the researcher to miss important information that would more fully allow the researcher to understand how L2 CAF develops. In addition, Robinson, Cadierno, and Shirai (2009) argued that researchers may achieve better precision in the characterization of interlanguage development if they strive to measure precise aspects of linguistic complexity or accuracy that closely align with the language content a task is supposed to elicit. Finally, Kuiken, Vedder, and Gilabert (2010) proposed that in order to investigate ability levels in oral performance more fully, it is important to consider the linguistic qualities of production in relation to “getting the message through” (p. 82), or what they and others called communicative adequacy or functional adequacy in an L2.

For the purposes of establishing the validity of the Chinese EIT as a research tool that can gauge oral global proficiency, in the present study, elicited imitation performances were compared to CAF measures extracted from an independent oral narrative task also elicited from the same participants. A set of CAF-related criterion measures was selected that would be relevant to the oral task at hand and that would tap linguistic qualities involved in productive oral use and be indicative of global oral proficiency in Hulstijn's (2011) SLA-oriented definition of core dimensions of BLC. Given that L2 learners of Chinese typically include both HLLs and FLLs with heterogeneous linguistic skills, it was even more important to engage in this comparison of elicited imitation performances against criterion measures as a complementary strategy that would help determine the usefulness of the new EIT as a research tool to measure global oral proficiency in L2 Chinese.

Research Questions

The present study investigated the reliability and validity of the newly developed Chinese EIT. The following quantitative research questions were addressed:

  1. To what extent do Chinese EIT scores distinguish between two institutionally defined high and low language ability groups?
  2. To what extent do Chinese EIT scores reflect differences between HLLs and FLLs within each high and low language ability group?
  3. What is the relationship between participants' performance on the Chinese EIT and their performance on an oral narrative task, evaluated in terms of three CAF-related measures: average numbers of clauses, motion clauses, and motion verb types?
  4. What features of the items might influence the outcomes observed for the EIT and help explain sources of varying item difficulty?



Eighty L2 Chinese learners at a large public university in the United States participated in the study (they were the same participants in S.-L. Wu, 2011b). They were sampled from the entire Chinese language program in a stratified random fashion along the two independent variables of interest: institutional status and language background. Regarding the first grouping variable, the institutional status definition of L2 proficiency identified by Thomas (1994, 2006) was used. Specifically, 40 of the learners were sampled from lower-division courses in the Chinese program and constituted the low-level group. This group included learners from different sections of the 200-level Chinese language courses. The other 40 were sampled from upper-division courses and formed the high-level group. This group comprised learners from mostly the 300- and 400-level Chinese language courses and some graduate courses related to Chinese studies. Based on information gathered from a background questionnaire (see Appendix A), the same 80 participants were further classified into HLLs (n = 40) and FLLs (n = 40). The criteria for classification as an HLL were (1) that learners identified their strongest language before age five as Mandarin Chinese or another Chinese dialect, or (2) that they had one or both parents with Mandarin Chinese or another Chinese dialect as their native or dominant language, and (3) that they also reported exposure to the language at home. This stratified random sampling strategy ensured a balanced study design featuring four groups of equal numbers (n = 20 each).

Procedure and Instruments

Three instruments were developed and administered, as shown in Table 1. The participants first completed the oral narrative task, designed to elicit evidence about their ability to express motion events in L2 Chinese, for which a measure of their global oral proficiency like the EIT was needed in the study (S.-L. Wu, 2011b). They then filled out the background questionnaire, designed to measure their prior contact with Chinese. Finally, they completed the EIT. Time spent to complete the entire study procedure was between 15 and 20 minutes.

Table 1. Procedure
StepInstrumentTime for completion


  1. Total time required: 15–20 minutes. See Appendixes A, B, and C for the instruments.
1.Oral narrative task5 to 8 min.
2.Background information questionnaire1 to 2 min.
3.Elicited imitation test9 to 10 min.

The Chinese EIT

The Chinese EIT (Zhou & Wu, 2009) was a Mandarin Chinese version of the EITs designed by Ortega et al. (2002). Following the original design, it was composed of 30 sentences, each ranging from 7 to 19 characters or syllables (see Appendix B). Given that Mandarin and English differ in their linguistic features, and considering that the 30 sentences needed to pose varying degrees of challenge for L2 Chinese learners at widely different ability levels, Zhou and Wu designed the Mandarin sentences by adhering to the following principles of parallel test translation. First, they sought to retain the original English meaning and structure by adopting Chinese equivalents of the English vocabulary and syntactic structure. When the translated version did not match the length of the syllables of the original English sentence, they incorporated function words or syntactic patterns that are unique to the Chinese language. (For the equivalence of item length between English and Chinese, it was considered that each character represented one syllable; see Li & Thompson, 1981; Lü, 1981.) The final 30 Mandarin sentences contained a wide range of vocabulary and grammatical structures and followed as closely as possible, although were not identical to, the original sentence stimuli across the other EIT versions (compare Appendix B).

During the 10-minute test, participants were instructed to repeat each sentence they heard as exactly as they could. To avoid rote repetition, an interval of 2.5 seconds was inserted between the end of each sentence stimuli and the sound prompting participants to start each repetition. The recorded performances were later evaluated using a five-point scoring rubric (developed by Ortega et al., 2002):

  • 4 = Perfect repetition
  • 3 = Accurate content repetition with some (ungrammatical or grammatical) changes of form
  • 2 = Changes in content or in form that affect meaning
  • 1 = Repetition of half of the stimulus or less
  • 0 = Silence, only one word repeated, or unintelligible repetition

It is worth noting that integrity of the meaning in the repetition was accorded central importance in the scoring rubric, differentiating a score of 3 from a score of 2 or a score of 4. This was in keeping with the original scoring rubric for the other EITs, which had been developed after extensive piloting and careful bottom-up analysis of typical sentence repetition breakdowns across different linguistic abilities in any given target language. It is also consistent with well-established evidence (Sachs, 1967) that, after a sentence is heard, the utterance's meaning can be stored for a significantly longer time than its linguistic form and specific wording, hence a score of 3 for accurate content repetition with some changes of form and of 4 for a perfect repetition that preserved the integrity of the original in both meaning and form. The highest possible individual EIT score was 120, based on 30 items, polytomously scored on a five-point scale from 0 (no repetition) to 4 (perfect repetition).

Oral Narrative Task and CAF Coding

The instrument used for the oral narrative task, shown in Appendix C, was a wordless picture story comprising 12 sequential pictures that depicted a story about a boy looking for his missing dog. A total of 12 motion event segments (see Table 2) were included in this story. Students were given two to three minutes to read the 12-picture strip and were instructed to prepare to tell a story in Chinese describing what they saw, providing as many details as they could. They each told the story at a separate computer station and were allowed to look at the pictures while telling the story. Each story was audio recorded and transcribed for analysis.

Table 2. Motion Event Segments in Oral Narrative Task
Motion Event SegmentsExamples of Motion Verbs Elicited


  1. Pictures are shown in Appendix C.
1. A boy is walking his dog.flan12063-gra-0046zǒu “walk,” flan12063-gra-0047 guàng “stroll”
2. The dog follows another dog into a school.flan12063-gra-0048gēn “follow,” flan12063-gra-0049pǎo “run,” flan12063-gra-0050zhuī “chase; catch,” flan12063-gra-0051 táo “escape”
3. The two dogs wander into a classroom.flan12063-gra-0046zǒu “walk,” flan12063-gra-0052 jìn “enter,” flan12063-gra-0049pǎo “run,” flan12063-gra-0053 dào “arrive”
4. The boy goes up the stairs.flan12063-gra-0054shàng “ascend,” flan12063-gra-0055 “climb,” flan12063-gra-0050zhuī “chase; catch”
5. The boy jumps onto a table.flan12063-gra-0056tiào “jump,” flan12063-gra-0054shàng “ascend”
6. The boy jumps off the table.flan12063-gra-0056tiào “jump,” flan12063-gra-0057xià “descend”
7. The boy runs out of the classroom.flan12063-gra-0049pǎo “run,” flan12063-gra-0058 chū “exit”
8. The boy goes down the stairs.flan12063-gra-0049pǎo “run,” flan12063-gra-0057xià “descend”
9. The boy kicks a ball.flan12063-gra-0059tī “kick,” flan12063-gra-0060gǔn “roll”
10. The dog brings the ball back.flan12063-gra-0061sòng “deliver; send,” flan12063-gra-0062dài “take; bring; carry,” flan12063-gra-0063 “take,” flan12063-gra-0064lái “come,” flan12063-gra-0065 huí “return”
11. The boy picks up the dog.flan12063-gra-0066bào “hold in arms,” flan12063-gra-0067zhǎo “look for,” flan12063-gra-0063 “take”
12. The boy and the dog leave.flan12063-gra-0065 huí “return,” flan12063-gra-0046zǒu “walk”

Three criterion measures gleaned from the oral narrative task were chosen: number of clauses, motion clauses, and motion verbs. Together, they yielded evidence about the fluency, communicative or functional adequacy, and vocabulary capacity deployed in the narratives by the same 80 participants who took the EIT. The learners' performance on these three measures offered an additional way to explore whether performances on the EIT would be consistently related to CAF qualities of L2 oral discourse. If so, this would be an indication that the two sources of evidence shared an underlying oral global proficiency basis, or that for any given participant, the sentence repetition performance and the speech qualities shown in the oral narratives drew from the same underlying global linguistic abilities—in Hulstijn's (2011) terms, the same core components of BLC abilities.

The rationale and operationalization for each measure was as follows. The total number of clauses in the oral narratives was thought to provide a task-relevant indication of learners' speaking fluency and communicative adequacy, given that participants were instructed to describe the picture story in as much detail as they could. For the calculation of this measure, a clause was defined as a unit containing a unified predicate that expresses a single situation (Berman & Slobin, 1994, p. 657). In order to ensure that the analysis reflected communicatively or functionally adequate clauses only, clauses that contained more than one word or phrase that was unintelligible or that showed use of the learner's first language were excluded. Because the pictures eliciting the oral story centered on 12 different motion events (see Table 2 and Appendix C), the number of motion clauses was thought to be a appropriate indicator of communicative effectiveness in terms of degree of task completion—that is, how well the learner could narrate the motion events portrayed in the stimulus story and how relevant the learner made the content of the narrative. Therefore, once clauses had been identified, all clauses that contained information about an entity changing location from one place to another were further identified as motion clauses (Talmy, 2000) and counted. Finally, the number of motion verb types produced in the motion clauses for each narrative was inspected as a task-relevant indication of lexical diversity, or more generally vocabulary capacity. As Zareva, Schwanenflugel, and Nikolova (2005), among many others, suggested, lexical diversity provides a good predictor of learners' general language proficiency.

Reliability of Scores and Coding

To establish coding reliability, 15% of the EIT as well as 15% of the narrative data were independently inspected by a second coder. The scoring and coding results between the two coders were compared, and a simple percent agreement of 95% was found for both the EIT coding and the narrative coding sets. Any disagreements were resolved by discussion, and the rest of the data were rated and coded by the first author.


As with the pilot study described earlier (see “A New EIT for L2 Chinese” above), it was important to establish whether the internal consistency of the test was satisfactory, as a way to establish the reliability of the EIT scores from this new and much larger sample. The value of .97 (n = 80) obtained on Cronbach's alpha suggested very high reliability. Next, descriptive statistics for the EIT scores were calculated, broken down into four groups based on two variables: institutionally defined ability levels (high, low) and learning background (heritage, foreign). These are shown in Table 3. As expected, learners classified as high ability level based on their institutional status performed better than learners classified as low ability level and, within each level, HLLs performed better than FLLs. The descriptive results were then submitted to inferential statistical analyses chosen to answer each research question. Below, we report these results in turn, presenting in each case the output for both probability (p) and magnitude (Cohen's d for mean group comparisons and r2 for correlations). The probability level was set at p > 0.01. The magnitude values were interpreted as small, medium, and large following the rule-of-thumb benchmarks proposed by Cohen (1988).

Table 3. EIT Scores


  1. The highest possible individual score was 120.

High-Low and HLL-FLL Group Comparison of Performances on the EIT

The first research question sought to determine the extent to which the EIT scores distinguished between the two institutionally defined language ability groups. The EIT scores collected from the 40 learners in upper-division courses were compared with those from the 40 learners in lower-division courses using an independent-sample t test. The learners enrolled in the upper-division courses achieved significantly higher EIT scores (M = 66.5, SD = 23.17) than did the learners enrolled in the lower-division courses (M = 45.55, SD = 20.52), t(78) = 4.28, p = 0.000. This difference was large in magnitude, as indicated by an advantage in favor of the upper-level participants of almost one standard deviation unit (Cohen's d = 0.96). The EIT scores clearly distinguished between the two groups of learners.

The second research question sought to determine to what extent the Chinese EIT distinguished between HLLs and FLLs. The analysis took into account the possible joint effects or interactions of curricular level (high- vs. low-ability groups) and learner background (HLLs vs. FLLs). The statistical analysis, a two-way ANOVA, was conducted on the disaggregated data of the four equal-sized groups of n = 20 (compare descriptive statistics in Table 3). The results are presented in Table 4.

Table 4. Results of Two-Way ANOVA
SourceSSdfMSFSig.Partial Eta Squared
  • *p < 0.01
Level × Background105.801105.800.235.629.003

There was a significant main effect for curricular level as well as for learner background, and no interaction was found between the two. HLLs performed significantly better on the EIT than FLLs, consistently within and across both curricular levels. In this curricular context, where HLLs and FLLs were instructed in the same classes, the HLLs consistently outperformed their FLL counterparts on the EIT, and this was true of both the lower-division participants and the upper-division participants. This significant learner background effect was also large in magnitude, accounting for 20% of the shared variance (see partial eta squared for background in Table 4). More specifically, the HLLs sampled from lower-division courses on average scored higher than the FLLs from the same courses by a mean difference of 14.7 points out of 120, and the HLLs sampled from the upper-division courses outperformed their FLL counterparts by a mean difference of 10.1 out of 120.

Relationship Between Performances on the EIT and the Oral Narrative Task

The third research question investigated the relationship between learners' performance on the EIT and the oral narrative task, the latter evaluated in terms of three CAF measures: number of clauses, motion clauses, and motion verb types. Table 5 shows the Pearson correlation coefficients between the EIT and each of the three measures for the full sample.

Table 5. Correlation Between Performance on EIT and Performance on Oral Narrative Task (n = 80)
MeasuresPearson's rr2p
  • *p < 0.01
EIT—number of clauses0.480.230.000*
EIT—number of motion clauses0.530.280.000*
EIT—number of motion verb types0.530.280.000*

The results show that the shared/explained variance between learners' EIT scores and the number of clauses, motion clauses, and motion verbs produced ranged from 23 to 28%, based on the effect size (r2) values shown in Table 5. Put differently, the higher the EIT scores a given learner in the sample received, the better command he or she tended to exhibit in producing communicatively adequate speech that was delivered smoothly, as measured by the number of clauses; the more capable he or she was of fully describing the different motion events depicted in the picture story, as measured by the number of motion clauses; and the better he or she tended to be at using richer and more diversified vocabulary in the target language, as measured by the number of motion verbs.

Given this evidence of a significant relationship between EIT performance and narrative performance, a posthoc question was asked about whether the three CAF-related measures taken from the oral narrative performance distinguished between the two curricular groups. In order to answer this question, upper- and lower-division group performances were compared on the three CAF measures: the number of clauses, number of motion clauses, and number of motion verb types produced in the oral narratives. As shown in Table 6, and consistent with the EIT results, the high-level learners performed significantly better than the low-level learners in terms of all three indexes, with effects that were of either medium size (slightly over half a standard deviation) for mean number of clauses and motion verb types or large size for average of number of motion clauses produced, although the magnitude of the differences observed for the CAF measures were generally not as large as the magnitude found for the differences between the low and high groups on the EIT scores.

Table 6. High-Low Group Comparison of the EIT and Oral Narrative Task
 tpCohen's d
  • *p < 0.01
Oral narrative task
Number of clauses3.030.003*0.68
Number of motion clauses3.760.000*0.84
Number of motion verb types2.800.007*0.62

Features of the Items Influencing Outcomes of the EIT

The final research question addressed the features of the items that influenced the outcomes of the Chinese EIT observed with the present sample. Unless indicated, performances and average scores in this section refer to the entire sample of participants (n = 80).

First, it was observed that there was a significant negative correlation between the mean score for a given item and its length in syllables, r = −0.86, p = 0.000. That is, the longer a sentence was, the harder it was to achieve a high score, and this relationship explained 74% of the variance in the item scores (r2= 0.74). In addition, however, it was also observed that syntactic complexity exerted an impact on the difficulty level. Items that contained a subordinate clause were more difficult than items with a simple sentence structure, even when overall item length was comparable. Consider the following examples:

(1) Item 11
flan12063-gra-0003flan12063-gra-0004flan12063-gra-0005 (13)
Zuótiānsǐ-lexiǎomāo de xiǎonánhái hěn shāngxīn.
Yesterday die-perfective kitten de little boy very sad
The little boy whose kitten died yesterday is sad. (13)
(2) Item 12
flan12063-gra-0006flan12063-gra-0007flan12063-gra-0008 (13)
Nà jiāfànguǎnde zhōngguócài yīnggāi hěn búcuò.
that classifierrestaurantde Chinese food should very good
That restaurant is supposed to have very good food. (13)

While Item 11 and Item 12 both had 13 syllables, Item 11 contained a subordinate clause and had a mean score of 1.47, whereas Item 12 used a simple sentence structure and had a higher mean score at 2.22, indicating that it was easier to repeat perfectly in both meaning and form; see “Methods” section above. For Item 11, the subordinate clause modifying the subject noun (i.e., flan12063-gra-0009 xiǎonánhái “little boy”) of the main clause posed a considerable challenge. In Chinese, the modifying subordinate clause precedes the head noun. Many learners were only able to repeat the subordinate clause and had a hard time understanding the semantic relationship between the subordinate clause and the main clause. Indeed, subordination emerged in the EIT performances as a feature that served well to distinguish higher proficiency. Among the 30 items, 9 items were very difficult, with a mean score lower than 1.5: Items 11, 14, 17, 18, 21, 22, 23, 24, and 27. Seven out of these nine items (i.e., Items 11, 17, 18, 22, 23, 24, 27) included a subordinate clause.

Besides length and syntactic complexity, another factor that had an impact on the repetition outcomes was learners' familiarity with the constructions used in the given item. This factor interacted with length in a reverse fashion from the interaction noted above for syntactic complexity. In the EIT, four items (i.e., Items 1, 2, 3, and 7) were the easiest, with a mean score higher than 3.0. While Items 1, 2, and 3 were also the shortest in length, at seven or eight syllables, Item 7 was longer, at 10 syllables, yet among the easiest. Analysis of Item 7 showed that it included the verb-copying construction (i.e., flan12063-gra-0010 kāi-chē-kāi-de-hěn-hǎo lit. “drive-car-drive-particle de-very-well”), which adds to the length of the item but is a basic Chinese grammatical structure introduced at the beginning level of Chinese curricula and frequently used in the language. Most of the learners seemed to be familiar with this basic Chinese construction and were able to repeat it easily. Thus, it appears from Item 7 that syntactic or grammatical familiarity may be more informative than sentence length in predicting how well learners can repeat the sentences.

The results also suggest that learners' familiarity with the vocabulary played a role. Items 17 and 24 were among the nine items with a mean score lower than 1.5. They each included a relatively low-frequency vocabulary item (Item 17: flan12063-gra-0011 yōumògǎn “sense of humor”; Item 24 flan12063-gra-0012 tǒngjì “statistics”), according to the dictionary TPS Frequency Dictionary of Mandarin Chinese compiled by Burkhardt (2010). Most learners hesitated or mumbled when trying to repeat these two words.

Finally, it was found that words having little lexical meaning or ambiguous meaning posed difficulty, such as the particle flan12063-gra-0013 le, denoting change of status or completion of an action, or the adverb flan12063-gra-0014 cái, denoting lateness of an event/action or conditional relationship between two clauses. These two words were often omitted from low-level learners' repetitions, although such omissions did not cause much change in meaning for the sentence overall. By contrast, the high-level learners demonstrated more advanced skill at retaining the original linguistic forms than did the low-level learners.


The present study explored the degree to which a newly developed Chinese EIT succeeds as a tool that can be used in SLA research to measure global oral proficiency in the L2. Consistent with the results reported on the parallel EITs in the other five L2s (see Ortega et al., 2002; Tracy-Ventura et al., 2013), the new Chinese EIT showed strong internal consistency for the EIT scores produced with the present learner sample. It successfully distinguished between learners enrolled in different sections of the 200-level Chinese language courses in this research context (the 40 low-level group participants) and those enrolled in upper-division courses (the 40 high-level group participants) as well as between HLLs and FLLs within each of the two curricular level groups. Furthermore, higher EIT scores were positively correlated with better performance on the different aspects of oral narrative task, which tapped fluency, communicative adequacy, and vocabulary capacity as measured by number of clauses, motion clauses, and verbs of motion. Finally, qualitative analysis of sources of difficulty on the EIT further confirmed that the instrument included an appropriate range of item difficulty levels spanning from very easy to very difficult. Relative difficulty stemmed from syllable length, syntactic complexity (particularly subordination), familiarity with grammatical structures, vocabulary familiarity, and semantic value of forms included in the sentence stimuli to be repeated.

The high reliability of the Chinese EIT scores can be attributed to three sources that are not mutually exclusive. First, it is an indication that the EIT created useful and consistent variability in the ensuing performances for the present sample. Second, the reasonable number of items in the test (k = 30) may have also contributed to the observed high reliability, because it is well known that, all things being equal, the more items on a test, the more reliable the resulting scores are likely to be. Nevertheless, other EIT instruments that have also proven to elicit reliable performances contained many more items than the present EIT. For example, the EIT developed and investigated by Cox and Davies (2012) contained 60 items, and they reported a Cronbach's alpha of .94 for scores from 179 L2 English learners who took it; Henning's (1983) EIT contained 90 items, and he reported high reliability for data from 143 L2 English learners, indeed superior to the reliabilities obtained with performances from a sentence completion task and an interview. Third, the high reliability of the new Chinese EIT may have also in part been boosted by the five-point scoring, which was nuanced yet straightforward to apply and may have thus contributed to low measurement error and good variability in the distribution of scores (for a discussion of factors that enhance test reliability, see Chapter 7 of Light, Singer, & Willett, 1990).

The findings for the first two research questions were straightforward and supported the expectations that the EIT would discriminate well between the two curricular ability levels and the two types of learner backgrounds within each. Noteworthy is the absence of any statistical interaction between learner background and curricular level on the ANOVA, because it suggests the EIT is useful in tapping subtle differences in linguistic skills between HLLs vs. FLLs, even when sampled from the same ability level posited in the curriculum in this higher education context. This pattern of results supports the more general intuition held by many SLA researchers of L2 Chinese as well as many Chinese language educators that HLLs can exhibit a wide range of L2 proficiencies yet tend to show strengths over their FLL counterparts along the full oral language development continuum.

The findings regarding the CAF indexes are also interesting not only in that higher EIT scores positively correlated with better performance on the different aspects of oral narrative task, but also that the high-level learners performed significantly better than the low-level learners in terms of all three interlanguage indexes. Both results support the validity of the three CAF measures as criterion measures for the present EIT investigation. They also point at the value of measuring the number of clauses, motion clauses, and motion verb types in future SLA research, as proposed by Robinson et al. (2009), who employed CAF measures specifically gauging the expression of motion events. On the other hand, it should be noted that elicited imitation remains more practical than CAF measurement as a research choice. For one, while it took participants only five to eight minutes to complete the oral narrative task, it took between 30 and 40 minutes to transcribe and code each recorded narrative. By contrast, each individual EIT set of 30 repeated sentences took about 10 minutes to complete and was scored in about 10 minutes on average. In addition, while elicited imitation responses can be transcribed prior to scoring, transcription is not necessary, whereas CAF analysis can only be performed on data that have been transcribed first.

In research question 4, we examined sources of varying difficulty in the designed sentence repetition items. We were interested in understanding qualitatively how the EIT works as a useful tool that discriminates performances along the widest possible range of linguistic abilities found in Chinese language programs and in SLA studies. Our findings for this last research question are in agreement with previous studies. Thus, they offer additional support from a non-European and less-commonly-taught language like Mandarin Chinese for the theoretical predictions originally proposed by Slobin and Welsh (1973) as to why elicited imitation works as a shortcut estimate of oral global proficiency, when proficiency is defined in narrow SLA terms as core dimensions of BLC (Hulstijn, 2011).

For one, the statistically significant and large negative correlation we found between syllable length and repetition scores (r = -0.86, r2= 0.74), indicating that learners at higher levels of language ability were more capable of repeating longer sentences than learners at lower levels, has also been similarly reported by Weitze and Lonsdale (2011). This relationship between item length and repetition difficulty is theoretically motivated (Slobin & Welsh, 1973); exact repetition of a meaningful utterance, especially when the length of the sentence is beyond the capacity of short-term memory, requires learners to be able to promptly comprehend the sentence after listening to it (see Okura & Lonsdale, 2012). As longer items generally include richer semantic content, it becomes more challenging for learners to process longer sentences, unless their global proficiency level is high enough to accommodate the grammatical parsing of more challenging lengths. The two related observations that syntactic complexity exerted an impact on the difficulty level and that length was sometimes outweighed by syntactic complexity have also been reported by Tracy-Ventura et al. (2013) for the parallel French EIT. Thus, the degree of syntactic complexity, and particularly subordination, plays a crucial role in the level of difficulty in the Chinese as in the French EITs. That learners' familiarity with the vocabulary plays a role also resonates with findings reported by Graham et al. (2010). It seems that having both a large vocabulary size and a high degree of mastery of vocabulary in both receptive and productive modalities may be essential in order to achieve a high score in our Chinese EIT, as much as in Graham et al.'s English EIT.

We also observed that high-level learners demonstrated more advanced skill at retaining the original linguistic forms in the EIT stimuli than did the low-level learners; this was particularly noticeable in the Chinese performances with words that have low or ambiguous semantic value (e.g., the particle flan12063-gra-0013 le or the adverb flan12063-gra-0014 cái), which can be omitted at no expense to the overall meaning. This makes good theoretical sense if one remembers that, in addition to semantic processing, an exact repetition requires learners to attend to linguistic forms and grammatical structures as well. Yet even in a first language, a well-known fact established by Sachs (1967) is that after a sentence is heard the linguistic form and specific wording of the utterance is more easily lost and forgotten than its meaning, which can be stored for a significantly longer time. The same fact is of even greater relevance in a second language because, as VanPatten (2004) has long noted, learners process input for meaning before they process it for form. The higher the proficiency level, the more cognitive resources can gradually be allocated to the processing of forms. Consequently, only when learners are familiar with the majority of the lexical items in the input can they allocate attentional resources to process the form of an utterance, and particularly specific individual forms with low semantic value within it. The processing skills required for retaining the form of an utterance thus demand higher linguistic proficiency, such that learners can engage in semantic processing without undue constraints and remain capable of devoting resources to process the form as well.

Given the encouraging results reported in this study, it is worthwhile to reflect on the added value of the EIT, particularly in terms of its practical usability for SLA research purposes. The Chinese EIT takes 10 minutes to complete and can be scored by human raters also in about 10 to 15 minutes. By comparison, the existing traditional oral proficiency interviews that L2 Chinese researchers may be able to use, such as the HSK (see or the AP Chinese test (see typically take between two and three hours to complete and require a qualified tester. The OPI and the OPIc for Mandarin Chinese take up to 30 minutes and are thus less demanding of time, but they are costly for researchers because of the fees associated with individual test administration, and/or the need for certified interviewers and/or raters. Furthermore, a full-blown standardized test or an oral narrative task may trigger more communicative stress than a sentence repetition task, and this consideration may make the EIT particularly well suited for use in studies where beginning and intermediate levels of language ability are of interest. Particularly with oral narratives, which are often employed in SLA studies, low-proficiency participants may experience anxiety because of the communicative demands and quickly reach the point of breakdown. If so, it may be difficult to elicit sufficient data for meaningful CAF analyses from L2 Chinese learners who have just recently begun studying the language. By comparison, even at beginning levels of instruction, most learners participating in an SLA study may be less intimidated by the request to repeat as much as they can of each Chinese sentence they hear.

Limitations and Conclusion

The results gleaned in the present study offer important evidence that the new Chinese EIT is an effective research tool in gauging global oral proficiency in L2 Chinese at the level of precision that is needed for measurement of what Hulstijn (2011) has defined as core components of BLC. Based on the present evidence, the new Chinese EIT can be recommended to SLA researchers in need of a practical, reliable, and valid shortcut instrument for measuring proficiency. It can be expected to work well along the potential full range of basic linguistic competencies that typifies L2 Chinese learner populations in many higher education contexts, including competencies ranging from beginning to very advanced ones and including learners from non-heritage and heritage language backgrounds.

However, the new Chinese EIT, or any EIT in general, may not meet the proficiency measurement needs of SLA researchers for all research purposes. What constitutes the best proficiency measurement choice for SLA research will vary depending on the research questions each researcher sets out to investigate and the research purposes for which proficiency as a variable is needed within each study. When different aspects of communicative competence are the research focus of a study, elicited imitation will not be the best technique, as it does not tap dimensions of communication such as sociolinguistic, discourse, and pragmatic ability. Nor could an EIT tap learners' ability to use the target language in interaction or for authentic speaking and listening. When measurement of functional literacy—that is, facility with the written language—is important for a given SLA study, the choice of elicited imitation would be misguided and other choices such as a cloze test, a c-test, or the ACTFL Writing Proficiency Test and the ACTFL Reading Proficiency Test would be in order. On the other hand, when it is important in a study to measure core linguistic proficiency that is independent from literacy skills, EIT may be an ideal choice, precisely because it measures global linguistic proficiency independently from and without presupposing or relying on knowledge of reading and writing.

While the present study did not address the use of the new Chinese EIT as a tool that might meet the assessment needs of L2 Chinese programs, this possibility looks promising. Application of elicited imitation for English has been found to be an effective assessment tool for placement purposes (see Cox & Davies, 2012; Okura & Lonsdale, 2012). Future research is called for that can establish the usefulness and validity of adopting the Chinese EIT as a language course placement test, whether independently or as a supplement to other tests already used in Chinese programs. An effective tool like the Chinese EIT, which can yield a short-cut estimate of global linguistic proficiency, would be in strong demand among L2 Chinese programs for several reasons. First, most placement tests currently adopted by Chinese programs only tap reading and writing skills, so the EIT would provide additional insight into students' oral skills and could result in improved placement decisions for some learner populations and contexts. Even more important, perhaps, the new Chinese EIT both reflected differences between the HLLs and FLLs in the present sample and captured the reality that HLLs in this particular research context tended to have better oral linguistic skills than their FLL counterparts within the same classes. This pattern of results supports the more general intuition held by many Chinese language educators that HLLs can exhibit a wide range of L2 proficiencies and therefore may be placed into any level of a given language program, yet they tend to show proficiency strengths over their FLL counterparts within the same level (e.g., S.-M. Wu, 2008). Future research that directly compares the EIT with other placement options should be undertaken to determine the extent to which the present test more accurately places into appropriate courses and instructional levels students who bring to the instructional setting very diverse learning profiles and abilities (Ke & Li, 2011).

To conclude, the data presented in the present study appear to offer convincing empirical evidence that the new Chinese EIT can be usefully utilized as a proficiency measurement tool in investigations of L2 Chinese for a variety of research needs. When a researcher's overall purpose is to produce an informative estimate of participants' relative global linguistic skills or abilities drawing from core components of BLC, the EIT offers an efficient use of both a participant's and a researcher's time and resources, while yielding useful and reliable information. The fact that the same parallel EIT exists for not only Chinese but another five target languages also makes it a promising tool for SLA researchers interested in future cross-linguistic research.


Background Questionnaire

Note: Some of the questions in this background information questionnaire were extracted and revised from a language background questionnaire designed by Kimi Kondo-Brown for departmental use. Revision and use of the questionnaire for the present study was with permission of the original author.

  • Q1. What was your first or strongest language before age 5?
  • □ English □ Mandarin Chinese □ Chinese dialect (specify) _____ □ Other (specify)______
  • Q2. What is your strongest language now?
  • □ English □ Mandarin Chinese □ Chinese dialect (specify) _____ □ Other (specify)______
  • Q3. Check if your parents, grandparents, or anyone else in your immediate/extended family is a native speaker of Mandarin Chinese or a Chinese dialect.
  • □ Mother □Father □Maternal grandparent(s) □ Paternal grandparent(s) □Other (specify)__
  • Q4. At what age did you start to hear or use Mandarin Chinese? ______________
  • Q5. Mandarin learning inside classroom
  • How long (in years) in total have you studied Mandarin at school? ____________________
  • List the following information for any previous Mandarin studies (e.g., college, high school, intermediate/elementary school, Chinese language school, private language institute, private tutor, etc.). Please also include the current study program.
  • School 1:________________________ (school name) in _____________ (country name)
  • Start year:______ End year:______ Hours of Mandarin class per week _________
  • School 2:________________________ (school name) in _____________ (country name)
  • Start year:______ End year:______ Hours of Mandarin class per week _________
  • School 3:________________________ (school name) in _____________ (country name)
  • Start year:______ End year:______ Hours of Mandarin class per week _________
  • Q6. Have you visited/lived in a Chinese-speaking country? □ No □ Yes (if yes, see below)
  • (At what age: _________; For _________ [length of the stay];
  • Location: ________________)
  • (At what age: _________; For _________ [length of the stay];
  • Location: ________________)
  • Q7. How much do you hear or use Chinese outside the classroom?
  • 1: never 2: occasionally 3: sometimes 4: frequently 5: almost always
  • - parents/grandparents speaking Chinese to you  N/A  1  2  3  4  5
  • - relatives/friends speaking Chinese to you  N/A  1  2  3  4  5
  • - self-study Chinese  N/A  1  2  3  4  5
  • - others (specify): _______________________  N/A  1  2  3  4  5


Mandarin EIT

Note: The Mandarin EIT was developed by Zhou and Wu (2009), based on the English EIT developed by Ortega et al. (2002). Revisions for each item were made when necessary to adjust the length of syllables or to more naturally reflect features of the Mandarin language in translation. Numbers in parentheses represent the total number of syllables included in each item.

1flan12063-gra-0015 (7)
 I have to get a haircut. (7)
2flan12063-gra-0016 (8)
 The red book is on the table. (8)
3flan12063-gra-0017 (8)
 The streets in this city are wide. (8)
4flan12063-gra-0018 (9)
 He takes a shower every morning. (9)
5flan12063-gra-0019 (10)
 It is possible that it will rain tomorrow. (12)
6flan12063-gra-0020 (11)
 What did you say you were doing today? (10)
7flan12063-gra-0021 (11)
 I doubt that he knows how to drive that well. (10)
8flan12063-gra-0022 (12)
 After dinner I had a long, peaceful nap. (11)
9flan12063-gra-0023 (12)
 I enjoy movies that have a happy ending. (12)
10flan12063-gra-0024 (12)
 The houses are very nice but too expensive. (12)
11flan12063-gra-0025 (13)
 The little boy whose kitten died yesterday is sad. (13)
12flan12063-gra-0026 (13)
 That restaurant is supposed to have very good food. (13)
13flan12063-gra-0027 (14)
 You really enjoy listening to country music, don't you? (14)
14flan12063-gra-0028 (14)
 She just finished painting the inside of her apartment. (14)
15flan12063-gra-0029 (15)
 Cross the street at the light and then just continue straight ahead. (15)
16flan12063-gra-0030 (15)
 I wish the price of town houses would become affordable. (15)
17flan12063-gra-0031 (15)
 The person I'm dating has a wonderful sense of humor. (15)
18flan12063-gra-0032 (16)
 I want a nice, big house in which my animals can live. (14)
19flan12063-gra-0033 (16)
 I hope it will get warmer sooner this year than it did last year. (16)
20flan12063-gra-0034 (16)
 A good friend of mine always takes care of my neighbor's three children. (16)
21flan12063-gra-0035 (16)
 Before he can go outside, he has to finish cleaning his room. (16)
22flan12063-gra-0036 (16)
 The most fun I've ever had was when we went to the opera. (16)
23flan12063-gra-0037 (16)
 The terrible thief whom the police caught was very tall and thin. (17)
24flan12063-gra-0038 (16)
 The number of people who smoke cigars is increasing every year. (17/18)
25flan12063-gra-0039 (16)
 The exam wasn't nearly as difficult as you told me it would be. (18)
26flan12063-gra-0040 (17)
 She only orders meat dishes and never eats vegetables. (15/16)
27flan12063-gra-0041 (17)
 The black cat that you fed yesterday was the one chased by the dog. (16)
28flan12063-gra-0042 (17)
 Would you be so kind as to hand me the book that is on the table? (17)
29flan12063-gra-0043 (18)
 I don't know if the 11:30 train has left the station yet. (18)
30flan12063-gra-0044 (19)
 There are a lot of people who don't eat anything at all in the morning. (19)


Oral Narrative Task

Note: This picture story was developed by the first author (S.-L. Wu, 2011b), and the pictures were drawn by the artist Pei-Hua Wu (©2011).

Instructions: You will see 12 sequential pictures in this task. Your job is to tell a story in Chinese to describe what you see in as much detail as you can.



  • Shu-Ling Wu (PhD, University of Hawai'i at Mānoa) is an Assistant Professor of Chinese at the Defense Language Institute-Hawai'i Learning Center, Mililani, HI.

  • Lourdes Ortega (PhD, University of Hawai'i at Mānoa) is Professor of Linguistics, Georgetown University, Washington, DC.