SEARCH

SEARCH BY CITATION

Keywords:

  • effect size;
  • interaction hypothesis;
  • quantitative research methods;
  • reporting practices;
  • study quality;
  • systematic review

Abstract

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

This article constitutes the first empirical assessment of methodological quality in second language acquisition (SLA). We surveyed a corpus of 174 studies (N = 7,951) within the tradition of research on second-language interaction, one of the longest and most influential traditions of inquiry in SLA. Each report was coded for methodological features, statistical analyses, and reporting practices associated with research quality, and the resulting data were examined both cumulatively and over time. The findings indicate not only strengths and weaknesses but a possible relationship between study quality and outcomes; improvements over time and methodological trends are also noted. In addition to providing direction for future research and research practices, the study's findings are discussed and contextualized within the research culture of SLA.

Progress in any of the social sciences depends on sound research methods, principled data analysis, and transparent reporting practices; the field of second language acquisition (SLA) is no exception. Within SLA, there are numerous books (e.g., Dörnyei, 2007; Hatch & Lazaraton, 1991; Mackey & Gass, 2005, in press) devoted to enhancing the field's research methods (see also Loewen & Gass, 2009, for a chronicling of statistical rigor in SLA). In addition, the peer-review process in SLA maintains rigorous and sophisticated control over published research. However, whether (or to what extent) quantitative studies in SLA have been carried out in adherence to standards of methodological quality is an empirical question, and one that this study seeks to answer. To accomplish this goal, we take one area of research—interaction-based research—as our subject of investigation. Interaction research was selected primarily due to the fact that there is perhaps no topic in the field of SLA that matches the volume, longevity, and impact of this research area.

Assessing Methodological Study Quality in Primary Research

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

Although scarcely mentioned in SLA, a robust line of research exists among other social sciences around the construct of study quality. Work in this area stems mainly from the meta-analysis literature, in which over 300 instruments have been designed to assess the quality of quantitative empirical research (see Wells & Littell, 2009). These measures have generally been used to weight effect sizes from primary studies based on the quality of their design, methods, and reporting practices. Systems for scoring and weighting primary research range from a simple exclusion/inclusion (essentially a weight of 0 or 1) to much more sophisticated procedures that attempt to approximate ultimate levels of psychometric precision (see Borenstein, Hedges, Higgins, & Rothstein, 2009, chapter 38). Regardless of the complexity of the approach, however, the assumption underlying instruments of this type is that studies of higher methodological quality should contribute more to the meta-analytic average than those of lower quality. Although it is not the intention of this study to assign a weighted value to the studies investigated, the criteria included in previous instruments of this type (e.g., sample size, random assignment to experimental conditions) constitute an important point of departure for a descriptive measure of research practices (for reviews of existing instruments, see Valentine & Cooper, 2008; Wells & Littell, 2009).

Beyond existing measures of methodological quality, there are several additional approaches worth considering in evaluating the quality of quantitative research in SLA, generally speaking, and, in particular, the interactionist tradition, the topic of the current study. One useful source is the Publication Manual of the American Psychological Association (APA). Not only has SLA traditionally built upon the methodological groundwork laid by psychology (Felser, 2005; Gass, 1993), but several SLA journals, including Language Learning, adhere to APA style. With such an overtly expressed connection, any measure of methodological rigor for SLA should consider the requirements of the APA as expressed in the Publication Manual as well as other, more specialized APA guidelines for publication (e.g., Journal Article Reporting Standards Working Group, 2008; Wilkinson & Task Force on Statistical Inference, 1999). Items in an assessment of research practices drawn from these sources might include reporting that assumptions of statistical tests have been checked or reporting (a priori) Type I error rate, standard deviations to accompany means (or other estimates of precision), and confidence intervals.

In addition to adherence to APA recommendations, some SLA journals provide guidelines that serve as a regulatory force to ensure methodological quality and a modicum of consistency among empirical studies. Chapelle and Duff (2003), for example, provide detailed requirements for manuscripts submitted for publication to TESOL Quarterly. Among other features, authors are instructed to report instrument reliability, power, and the exact p-value and effect size resulting from all statistical tests (see also DeKeyser & Schoonen, 2007; N. C. Ellis, 2000). Most journals in SLA are less explicit, relying on the peer-review process, an essential component of scientific progress (see Burnham, 1990; Hopewell, Clarke, & Mallet, 2005; Jefferson, Alderson, Davidoff, & Wager, 2003), to ensure that only those studies of the highest quality are published (Loewen & Gass, 2009). That is not to say that the editors of those journals have been unconcerned with the importance of rigorous methods and reporting practices. In fact, early in her editorship at The Modern Language Journal, Magnan (1994) expanded the review process of the journal to include a “specific review for appropriateness of research design, methods, and statistical procedures” (p. 8) for all empirical submissions that advance beyond the first round of reviews.

An evaluation of quantitative research practices can and should also be informed by cumulative and historical perspectives presented in previous reviews. Several reports have described second-language (L2) research methods and reporting practices generally, but without the explicit intention to assess their quality (e.g., Edge & Richards, 1998; Henning, 1986; Lazaraton, 2000). Chaudron's (2001) review of nine decades of classroom-based research in The Modern Language Journal includes a thoughtful and at times critical discussion of methodological problems and shortcomings. He lamented, for example, measures of low reliability, generally poor design, and the fact that “intact groups are the norm” (pp. 66–67; see also Norris & Ortega, 2003; Read, 2007). Similar commentaries have also surfaced in several meta-analyses. Norris and Ortega (2000) made pointed suggestions for improving L2 research methods, including more complete reporting of data (e.g., confidence intervals, effect sizes, minimal information to calculate an effect size) and greater use of pretesting in quasi-experimental and experimental studies. Oswald and Plonsky (2010) made similar suggestions, commenting on and citing the number of studies six L2 meta-analyses excluded due to incomplete reporting of data, ranging from 16 in Russell and Spada (2006) to 35 in Plonsky (in press). Several meta-analyses have also hypothesized about and examined empirically the relationship between methodological quality and study outcomes. Russell and Spada (2006) calculated the average effect of error correction based on whether individual studies reported reliability and validity of outcome measures. Likewise, Plonsky (in press) formed subgroups of studies of L2 strategy instruction based on three aspects of methodological quality, finding substantially larger effects for studies that (a) pretested (d = 0.54) versus did not pre-test (d = 0.39), (b) employed random group assignment (d = 0.65 vs. d = 0.42), and (c) reported reliability (d = 0.65 vs. d = 0.42). These findings along with other suggestions for reform (e.g., Byrnes, 2008) point not only to the presence of weaknesses in SLA research but also to a possible relationship between different methodological/reporting practices and study outcomes (see Prentice & Miller, 1992; see also Lipsey & Wilson, 1993, for a synthesis of meta-analyses from psychology and education that compares effect sizes from primary studies according to methodological quality). Finally, reviews of individual applied linguistics journals have also found a lack of perceived importance of rigorous methods (e.g., Egbert, 2007; Magnan, 2007; Smith & Lafford, 2009). For example, only 2 of the 31 applied linguists surveyed by Egbert (2007) cited sound research design in articles as a factor in determining the value of a particular TESOL-related journal, and none of Smith and Lafford's (2009) participants cited design quality in their evaluation of CALL journals.

In order to more clearly illustrate and understand the quality of SLA research practices, it will be useful to consider different study features not only cumulatively but over time. To begin this process, we consider a subarea of SLA—interaction-based research (see Plonsky, forthcoming, for an analysis of SLA, more generally). As will be discussed below, the methods in interactionist research have evolved significantly since the 1980s. Therefore, it is not unrealistic to expect certain methodological improvements to have entered this domain over time. Russell and Spada (2006), for example, found an improvement in reporting practices in “the evolution of studies on CF [corrective feedback] over the past 15 years” in that “most of the recently published studies met the criteria for inclusion in a meta-analysis” (p. 156).

Like methods, evidence is never static (see Trikalinos et al., 2004). Additionally, improvements to methods and instrumentation may lead to larger effect sizes over time (Fern & Monroe, 1996; Oswald & Plonsky, 2010). In contrast, we can imagine an alternate scenario playing out in a body of literature. Early research in a given area is often characterized by strong manipulations that set out to determine whether an effect exists and thereby determine whether the claims of a particular and usually novel hypothesis merit further attention. Such experiments would tend to yield large effect sizes (Kline, 2004). Subsequently, after an effect is found (e.g., L2 gains resulting from interaction), research efforts may shift to the generalizability of an effect across samples, settings, and tasks as we have seen in the interactionist literature. In areas where this scenario is observed, theoretical maturity would be inversely correlated with outcomes and thus a decrease in effect sizes would be obtained over time (Plonsky & Oswald, 2010, in press).

We turn next to a review of the interactionist tradition in SLA, the particular subarea of SLA used to investigate study quality. Although the methods from several subsets of the interactionist literature have been described in part (e.g., Mackey & Goo, 2007) and criticized (Leow, 2000; Simard & Wong, 2001), to date there has been no comprehensive assessment of research practices or study quality in the interactionist tradition. In fact, to our knowledge, no subdomain within SLA has been subject to a comprehensive review of this nature. The principal aim of this article is, therefore, to examine the methods and evaluate the methodological quality of research on L2 interaction. To that end, we have taken a longitudinal perspective, surveying a representative corpus of quantitative interactionist studies published over the past 30 years.

A Brief History of Theory and Research on L2 Interaction

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

Influenced in its early years by Krashen's input hypothesis (Krashen, 1977, 1982, 1985) and Swain's output hypothesis (1985), L2 interaction research holds that negotiations resulting from the communicative pressure that is present during L2 interaction promote comprehension as well as linguistic development. This simple yet foundational assertion, expressed formally in Long's (1983a, 1996) interaction hypothesis, has led to numerous books and edited volumes (e.g., R. Ellis, 1999; Gass, 1997; Gass & Madden, 1985; Mackey, 2007a; Mackey & Polio, 2009), several meta-analyses (e.g., Keck, Iberri-Shea, Tracy-Ventura, & Wa-Mbaleka, 2006; Mackey & Goo, 2007; Russell & Spada, 2006), and hundreds of empirical and review articles.

Since its inception in the early 1980s (see Gass & Mackey, 2006, 2007; Mackey & Gass, 2006), the interactionist tradition in SLA has maintained a dynamic research agenda by regularly introducing and testing increasingly nuanced models of interaction and its effects on L2 development (Mackey, 2007b). As theory in this area developed and new questions were posed, the range of research methods and statistical procedures employed to address and answer those questions has also expanded (Mackey & Gass, 2006). For example, the largely observational designs where interaction was the outcome variable sought to quantify: (a) features (e.g., clarification requests, repetitions; cf. Gass & Varonis, 1985a, 1985b; Long, 1980; Sato, 1986) of L2 interaction; (b) learner variables (e.g., gender [Gass & Varonis, 1985a; Pica, Holliday, Lewis, & Morgenthaler, 1989], metalinguistic awareness [Hirvonen, 1985], age [Scarcella & Higa, 1981], L2 proficiency [Kleifgen, 1985]; and (c) environmental and task-related variables (e.g., task type [Brown, 1991; R. Ellis, 1985], the presence of native speakers [Derwing, 1989; Hirvonen, 1985; Long, 1983b] and/or teachers [Pica, 1987; Pica & Doughty, 1985a, 1985b], the number of interlocutors [Doughty & Pica, 1986], and task essentialness [Doughty & Pica, 1986; Hawkins, 1985; Loschky & Bley-Vroman, 1993]). This research gave way to experimental and quasi-experimental studies designed to measure L2 gains resulting from interaction (i.e., interaction as a treatment or independent variable). Likewise, as additional variables (e.g., attention, corrective feedback) were introduced, novel research designs and instruments were developed and tested with respect to their influence on the processes and outcomes of L2 interaction.

Given the predominantly descriptive nature of early research on interaction, most quantitative studies employed observational or ex post facto designs and took place in laboratory contexts. In fact, although a variety of new variables was introduced, there was a great deal of methodological homogeneity. One work that stands out, however, was Hawkins (1985). In her study of the comprehensibility of foreigner-talk, Hawkins provided the first example in the interactionist literature of what would later be referred to as stimulated recall, a technique that would later be used widely in this and other areas of SLA (see Gass & Mackey, 2000).

In the early 1990s, researchers moved beyond a mere description of the features of interaction and began to examine their effect on acquisition. Methodologically speaking, this second major wave of research made several significant contributions. This period was marked, for example, by an increase in classroom-based (as opposed to lab-based) studies—an indication of a domain's theoretical maturity according to Oswald and Plonsky (2010)—and experiments involving learners of languages other than English as a second language. Furthermore, as the role of interaction shifted from outcome/dependent variable to treatment/independent variable, more researchers began to include pretests and posttests (e.g., Fotos & Ellis, 1991) as well as comparison groups (e.g., Tomasello & Herron, 1989). Most studies in this line of (quasi-)experimental research on the relationship between interaction and L2 development have analyzed their data at the group level, using analyses of variance (ANOVA) and other similar statistical tests to compare means between or within groups. Two recent meta-analyses—Keck et al. (2006) and Mackey and Goo (2007)—have synthesized this body of research, each taking a slightly different scope but arriving at the same finding of a generally medium-to-large effect for interaction.

Although the overall effects of interaction on acquisition appear to be both statistically and practically significant, it is important to recognize the range of research designs and practices found across the many individual studies designed to test different facets of that relationship. In the early 1990s, theorists began to question not only the developmental effects of interaction but the longevity of those effects as well. As a consequence, delayed posttests were introduced into the designs of studies (e.g., Carroll & Swain, 1993; Fotos, 1993; Fotos & Ellis, 1991; Mackey, 1999). Theoretical interest in other variables hypothesized to relate to interaction has also inspired the development of novel techniques and approaches. Noticing, for example, a construct that has proven slippery in the hands of SLA researchers (see, e.g., Godfroid, Housen, & Boers, 2010; Schmidt, 2001), has prompted the design of several unique operationalizations for both online (underlining, accurate immediate recall of recasts; e.g., Fotos, 1993; Philp, 2003; uptake sheets, Mackey, 2006) and offline (stimulated recall, journals, questionnaires, and introspective interviews following interaction; e.g., Lai & Zhao, 2006; Mackey, 2006) data collection techniques.

Not all of interactionist research originates exclusively in the realm of theory. More practical concerns have also entered the sights—and designs—of researchers. One example is the potential benefit for learners who interact via computer-mediated communication (e.g., Blake, 2000; Kelm, 1992). Another area of interest to both theory and practice is corrective feedback (CF). Early studies of CF as a feature of interaction were interested in describing patterns of native speaker/nonnative speaker conversation (e.g., Day, Chenoweth, Chun, & Luppescu, 1984). By the mid- to late-1990s, interest in CF had expanded to occupy a central role not only within the theoretical and empirical work of the interactionist tradition but also within the field of SLA more generally. The immense volume of research on this topic is reflected in the six meta-analyses that have synthesized quantitative findings from studies of interactional CF (Keck et al., 2006; Li, 2010; Lyster & Saito, 2010; Mackey & Goo, 2007; Norris & Ortega, 2000; Russell & Spada, 2006). Aditionally, despite what may appear to be an exhaustive line of research, studies of CF continue to occupy a significant portion of journal pages and conference programs.

As one of the longest-standing interests within the interactionist tradition, different models of CF have been proposed, tested, and refined extensively, further illustrating the theoretical maturity of the domain. For example, early CF studies were motivated by an interest in whether CF was present or absent from an exchange between interlocutors (e.g., Tomasello & Herron, 1989), whereas recent studies look closely at the nature, effects, and specific types of CF (Loewen & Philp, 2006; Lyster & Mori, 2006; Sheen, 2006) and various types of form-focused episodes and language-related episodes (e.g., Williams, 1999).

This discussion of methods would be remiss to ignore the use of statistical techniques and analyses in interactionist research. As mentioned earlier, the research questions of early studies were generally addressed using frequency counts (e.g., Gaies, 1981), percentages (e.g., Pica, 1987), and occasionally chi-squares (e.g., Brown, 1991). Quantitative studies examining the link between interaction and acquisition, on the other hand, more often required the use of inferential statistics such as t tests (e.g., Nagata, 1997), ANOVAs (e.g., Mackey & Philp, 1998), other means-based tests (e.g., multivariate analysis of variance; Muranoi, 2000), and the nonparametric counterparts of these tests such as the Mann-Whitney U-test (Lin & Hedgcock, 1996) and the Wilcoxon signed ranks test (Kim & McDonough, 2008). The objectives of still other studies (Sheen, 2007; Takimoto, 2006) required a factor analytic approach to the data.

This brief review has shown that the interactionist tradition has not only been prolific over the last 30 years, producing an enormous body of scholarly literature, but it has also showed many signs of both theoretical and methodological maturity (Mackey, 2007b). Considering as well its highly visible status in the field, interaction makes for an appropriate, if not ideal, domain to begin investigating the methodological quality of research conducted in SLA.

Research Questions

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

Considering the major issues discussed thus far—namely, (a) the need to take stock of research practices in our field, (b) the centrality and longevity of the interactionist tradition in SLA, and (c) recent concerns expressed about the rigor of empirical efforts in SLA (e.g., Norris & Ortega, 2000; Oswald & Plonsky, 2010)—this study seeks to answer the following research questions:

  • RQ1. To what extent have quantitative studies of interactionist research employed various study designs and statistical procedures?

  • RQ2. To what extent have quantitative data in interactionist research been reported thoroughly?

  • RQ3. Is there a relationship between research and reporting practices as measured through our analysis in research questions 1 and 2 and study outcomes (i.e., effect sizes) in interactionist research?

  • RQ4. How have different aspects of study quality including designs (as addressed in RQ1), statistics (as addressed in RQ1), reporting practices (as addressed in RQ2), and outcomes (as addressed in RQ3) in interactionist research changed over time?

Method

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

The above four research questions were addressed by surveying a representative collection of quantitative studies carried out in the interactionist tradition. Although not a meta-analysis per se, this study utilizes several meta-analytic techniques to retrieve, code, and analyze the body of primary research being investigated.

Study Retrieval

The first step was to define the domain. This was done in one of two ways: (a) by the scope and (b) by the location of publication (see below). For the purposes of this study, the interactionist tradition was determined to be any study interested in the features and/or effects of interaction involving L2 learners. Given the interest of this study on statistical procedures and data reporting practices, only quantitative studies were included in the sample. Additionally, although L2 interaction has been investigated within other research traditions such as conversation analysis (see Hall, 2010, and Gass, 2004, for a discussion of differences and commonalities between conversation analytic and interactionist approaches to L2 interaction), and sociocultural theory (e.g., Ohta, 2000), the focus of this article is on those quantitative studies falling into the line of research both leading up to and prompted by the interaction hypothesis (Long, 1983a, 1996).

The choice of where to locate studies in the tradition of interactionist research began with the assumption that SLA is a “journal culture” (VanPatten & Williams, 2002, p. 10) as opposed to a book culture (see Smith & Lafford, 2009). Therefore, we decided to limit the search to the 15 journals that regularly publish SLA articles as listed in VanPatten and Williams (2002). Language Learning & Technology was then added to the list of titles to increase the study's inclusiveness, and Journal of Second Language Writing and Language Awareness were removed because the scope of those journals was considered to be outside the focus of this study. Chapters from two edited volumes—the earliest book of empirical studies of L2 interaction (Gass & Madden, 1985) and the most recent (Mackey, 2007a)—were also considered and included as “bookends” to the domain.1 Once the list of sources was compiled, every issue of every title was manually searched to locate relevant studies published between 1980 and summer 2009. The start date of 1980 was based on two considerations. First, numerous articles (Gass & Lewis, 2007; Gass & Mackey, 2006, 2007; Mackey & Gass, 2006; Polio, Gass, & Chapin, 2006) have cited the early 1980s as the beginning of interactionist research. Second, Long (1980) completed his dissertation on input and interaction in 1980, which is frequently cited as a seminal work in this area. Table 1 lists each source and the dates included in the search that vary based on when they were first published. As a preliminary means of determining the dataset, a total of 224 quantitative studies were identified from these sources as potentially meeting the criteria described earlier. After separately examining each article's title, abstract, and, when necessary, body and references, we agreed on an initial inclusion/exclusion of 221 of the 224 studies (99%); we then further examined these articles and 174 were included; 50 were excluded given their lack of meeting the criteria of being part of the interactionist tradition as defined above, including the 3 that were not agreed upon. The number of studies from each source is also shown in Table 1. The complete list of studies included in the present synthesis is available as online supporting information Appendix S1 on the Language Learning Web site.

Table 1.  Sources, years of issues searched, and number of interactionist studies
SourceYear(s)K
  1. aEdited volume.

Applied Language Learning1990–20092
Applied Linguistics1980–200913
Applied Psycholinguistics1980–20090
Bilingualism: Language and Cognition1998–20090
Canadian Modern Language Review1980–20091
Foreign Language Annals1980–20093
Gass & Maddena19856
Language Learning1980–200929
Language Learning & Technology1997–200914
Language Teaching Research1997–200914
Mackeya200715
The Modern Language Journal1980–200917
Second Language Research1985–20091
Studies in Second Language Acquisition1980–200934
System1980–200913
TESOL Quarterly1980–200912
Total 174

Coding

Based on the sources and previous instruments for assessing methodological quality described earlier, a coding scheme (see Table 2) was developed and employed to extract information about each study in the following five categories: (a) study identification (e.g., year of publication, journal), (b) design (e.g., use of comparison group, delayed posttesting), (c) analyses (e.g., correlation, t test), (d) reporting of data (e.g., instrument reliability, statistical significance), and (e) outcomes (i.e., effect sizes). Of all the features in Table 2, four emerged as most important when exploring methodological quality. They were, in order of preference for empirical control, as follows: random group assignment, inclusion of a control or comparison group, pretesting, and delayed posttesting. The first author coded all 174 articles; two additional, trained raters recoded 10 studies each (20 studies total or 12%). Agreement was at 99.5% for the first additional rater and 96.3% for the second additional rater; overall percentage agreement was 97.9%.

Table 2.  Coding scheme: Categories and itemsa
Coding categoryItemsb
  1. aAdditional variables were also coded but are not included here due to their mostly substantive (and not methodological) nature.

  2. bWith the exception of effect sizes, all items were coded as a binary yes/no or present/absent.

  3. cStudies were coded as observational if their design was descriptive, seeking to examine or measure a particular phenomenon; experimental studies were those that sought to measure the effect (usually by means of a posttest) of a particular phenomenon or treatment.

  4. dPretests were used in some studies to ensure equivalence of groups prior to carrying out a particular task but in others to compare to posttests to measure gains or change over time. Both types were coded as +pretest.

  5. eDelayed posttests were any posttreatment measures given after an initial posttest.

IdentificationAuthor, Year, Journal, Title
DesignObservational/experimental,c Pretest,d Delayed posttest,e Comparison group, Classroom/lab, Random assignment: by group, Random assignment: by individual, N: comparison group(s), N: treatment group(s)
AnalysesCorrelation, Chi-square, t test (or a nonparametric equivalent), ANOVA (or a nonparametric equivalent), ANCOVA, MAN(C)OVA, Factor analysis, Regression
ReportingReliability (instrument or interrater), Type I error rate, Power, Statistical assumptions checked, Percentage, p-value: exact, p-value: >or<, Mean, Standard deviation, Mean of gains, Standard deviation of gains, Frequency, t value, f value, Confidence intervals, Effect size, Effect size calculatable
OutcomeEffect size(s)

Analysis

Research questions 1 and 2 were addressed by calculating descriptive statistics for the different research designs, methodological features, statistics, and data reporting practices within the cumulative body of SLA research as defined by this study. Frequencies and percentages were also calculated for the four features associated with methodological quality (random group assignment, inclusion of a control or comparison group, pretesting, delayed posttesting) across four different types of designs: (a) observational studies carried out in classrooms (O + C), (b) observational studies carried out in laboratories (O + L), (c) (quasi-)experimental studies carried out in classrooms (E + C), and (d) (quasi-)experimental studies carried out in laboratories (E + L). To answer research question 3, we examined effect sizes (d values) within subgroups of studies/samples based on different methods and practices associated with methodological quality (i.e., using items in the coding scheme from research questions 1 and 2 as grouping or independent variables). When more than one effect size was extracted for a particular sample, those effects were averaged. Effect sizes at the group level (e.g., did vs. did not pretest) were then compared both overall and across the four types of designs described earlier. Of course, studies that reported insufficient data to calculate an effect size (d value) were excluded along with d values larger than 3, which were considered outliers.2 We took an historical—rather than cumulative—approach to research question 4 by comparing the data obtained to answer research questions 1, 2, and 3 over three 10-year intervals (1980–1989, 1990–1999, and 2000–2009). More specifically, percentages of each design type/feature, statistical analysis, and type of data reported were calculated for each decade. Average d values were calculated for each 10-year span as well.

Results

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

The first research question was motivated by an interest in several facets of study design and methodological quality. Specifically, it asked the extent to which methodological features and statistical/analytical procedures had been present in quantitative studies of interaction. Table 3 shows the frequency and percentage of studies/samples for which different methods were used. Overall, we see that interaction is an area of research that has employed observational and laboratory designs to test its hypotheses substantially more often than experimental and classroom-based studies. Looking across the four design categories, we see a generally larger portion of the features of study quality among experimental rather than observational research (which is not entirely surprising given the greater applicability of those features to research seeking to test the outcome of a particular treatment). Random group assignment was used in 78% of E + L studies and to a much lesser extent among O + C (0%), O + L (31%), and E + C (23%) studies. Likewise, we found comparison groups present in nearly all E + L studies, approximately half the E + C and O + L studies, and about a third of the O + C studies. Pretests were used most often in E + C studies (87%), slightly less in E + L studies (78%), and substantially less in both types of observational studies. Among those studies seeking to measure the effect of a particular treatment or intervention (i.e., E + C and E + L), approximately 80% have examined the durability of treatments using one or more delayed posttests.

Table 3.  Design features in studies of L2 interaction
VariableValueObservational + classroom (54 studies total)Observational + lab (72 studies total)Experimental + classroom (30 studies total)Experimental + lab (37 studies total)
K%K%K%K%
  1. Note. A small number of studies were both observational and experimental (e.g., Iwashita, 2003), or were carried out in both laboratories and classrooms (Gass, Mackey, & Ross-Feldman, 2005). Thus, the total number of studies across cells is greater than 174.

Random assignmentIndividual 0 02231 7232978
Group 2 4 1 1 620 2 5
Comparison groupYes1732354924603595
PretestYes1222124026872978
Delayed posttestYesNANANANA23773081

Table 4 shows the size of samples, a factor related to power and precision of findings and, thus, research quality. Not only are there more treatment groups and participants, but there is also a noticeable difference between the average size and range of sample sizes for treatment groups (23.47, 1–260) and comparison groups (19.26, 4–117).

Table 4.  Samples sizes in studies of interaction
GroupaTotal NbTotal KcN/KdMinimum-Maximum
  1. aFifteen studies (21 samples) compared alternate conditions, neither of which was identified as a comparison or treatment group, and these samples were coded for experimental condition somewhat arbitrarily.

  2. bTotal number of participants. When a study included only one sample, participants were counted as belonging to a treatment group.

  3. cTotal number of samples. It is common to include multiple treatment conditions, which is why there are more samples than studies.

  4. dN/K = total number of participants divided by the total number of samples. Additionally, four studies did not report a sample size, so average sample sizes for treatment and comparison group were calculated based on 255 and 102 groups, respectively.

Treatment5,98625923.471–260
Comparison1,96510619.264–117
Total7,95136522.27 

Research question 1 also asked about the use of different statistical procedures and techniques. In Table 5 we see that well over half of the studies of interaction were interested in testing for differences between group means. Categorical or frequency data were also used to calculate chi-squares in nearly one third of the studies, and 15% of the studies in this sample used correlations and/or regressions. It is also interesting to note that almost two thirds of the studies presented data for multiple statistical tests.

Table 5.  Statistical analyses in studies of interaction
Type of analysisK% of total
t Test6940
ANOVA5431
Chi-square5029
Correlation1810
Regression 8 5
ANCOVA 7 4
MAN(C)OVA 7 4
Factor analysis 2 1
None10 6
One5632
Multiple108 62

Research question 2 approaches study quality by examining data reporting practices with respect to several sources, including the guidelines provided by the APA and several SLA journals, measures of methodological quality from the research synthesis literature (e.g., Valentine & Cooper, 2008), recommendations for improving SLA reporting practices (e.g., Chaudron, 2001; Norris & Ortega, 2006), and meta-analyses that have indentified methodological weaknesses in L2 research (e.g., Russell & Spada, 2006). Table 6 displays the frequencies and percentages with which different types of data were reported. Reliability estimates are one area that, although not perfect, have been reported very well (compared to other areas of SLA research; see Norris & Ortega, 2000; Plonsky, in press). Most other statistics have been reported either insufficiently or unevenly across the sample of studies. Although almost all of the studies reported using statistical tests (see Table 5), only 25% reported setting a predetermined level of statistical significance, 2% reported the results of a power analysis, and only 3% (five studies, three by the same author, McDonough) reported checking the assumptions of their statistical tests. A somewhat larger portion of studies reported statistical significance as an exact p-value (44%) as opposed to greater or less than a particular p-value such as .05 (61%). However, these figures appear low, again, in light of the very high percentage of studies in the sample that employed statistical tests. Furthermore, reports in this area are not only inconsistent in the aggregate; 46 studies (26%) reported both exact and relative (i.e., <or>) p-values. Means and standard deviations were presented in 64% and 52% of the studies, respectively. These figures are also somewhat low considering the frequency of studies employing mean-based statistical tests. Moreover, those data also indicate that 12% of the studies reporting means did so without reporting the standard deviations of those means. Along these same lines, we also see that the percentage of studies reporting t values and f values was only 26% and 32% (compared to 40% of studies reporting t tests and 39% reporting ANOVAs, ANCOVAs, and/or MAN[C]OVAs). Other statistics coded for were confidence intervals, reported in only five studies (3%), effect sizes (including d values and η2 for mean differences, phi coefficients for χ2, r2 and Cramer's V for correlations; 18%), and whether an effect size (Cohen's d) could be calculated from data in the report (41%). Finally and perhaps most surprisingly, 5% of the studies in the sample did not report sample size.

Table 6.  Data reported in studies of interaction
VariableYesNo
K%K%
  1. aReliability includes rater and instrument reliability.

  2. bStudies that reported percentages or correlation coefficients were not coded as having reported an effect size.

Reliabilitya111646336
Predetermined Type I error rate432513175
Power3217198
Statistical assumptions checked5316997
p-Value, exact76449856
p-Value, <or>106616839
Mean112646236
Standard deviation90528448
Mean gains10616494
Standard deviation of gains9516595
Frequency122715229
Percentage108626638
t Value452612974
f Value553211968
Confidence intervals5316997
Effect sizeb311814382
Effect size (Cohen's d) calculatable714110359

Research question 3 investigates the outcomes (i.e., effect sizes) produced in interactionist research in relation to different designs as well as research and reporting practices. With respect to design types not associated necessarily with quality (see Table 7), somewhat larger effects were found for experimental over observational studies (d = 0.72 vs. 0.51). Although this difference is not statistically significant due to overlapping confidence intervals, it is worth noting the relatively narrow confidence intervals around the mean for experimental studies, which indicate a relatively precise estimate of that population of studies. There was virtually no difference between studies carried out in classrooms (d = 0.64) and laboratories (d = 0.65). We also see that effects from classroom studies are the most homogeneous of the four design types, as seen in a relatively small standard deviation.

Table 7.  Subgroup analysis of effect sizes based on design type and setting
VariableValuekM (d)SD (d)95% Confidence interval
  1. Note. All effect sizes aggregated here were extracted from comparison versus experimental contrasts on immediate posttest results.

DesignObservational390.510.830.25–0.76
Experimental810.720.710.57–0.88
SettingClassroom420.640.630.45–0.83
Lab640.650.810.45–0.85

Table 8 presents effect sizes from studies with four design features associated with quality across the four design types. The process by which a study assigns participants to conditions appears to relate to its outcome, with substantially larger effects for studies employing random group assignment at the individual level (d = 0.70) compared to random assignment at the group or class level (d = 0.39). At the same time, however, considering their almost entirely overlapping confidence intervals, there is essentially no difference between studies that randomly assign at the individual level and studies that do not randomly assign participants at all. Among studies that included a comparison group, experimental studies appear to produce larger effects than observational ones, mirroring the overall pattern in effects between observational and experimental designs. A slightly smaller effect (with mostly overlapping confidence intervals) was found for studies with pretests compared to those without. Again, similar to studies with comparison groups and overall, experimental studies with and without pretests produced larger effects than observational studies of both types. Finally, perhaps the most striking relationship between study features and outcomes is found among experimental studies that do and do not include delayed posttests in their designs. The overall effect obtained on the immediate posttest from studies with delayed posttests is much larger (0.90) than those without (0.20), and the accompanying confidence intervals do not overlap. This pattern also holds when looking at experimental classroom (E + C) and lab designs (E + L) with and without delayed posttests separately, although the confidence intervals between E + C studies with and without delayed posttests overlap slightly.

Table 8.  Subgroup analysis of effect sizes based on design features associated with study quality
VariableValuekM (d)SD (d)95% Confidence interval
  1. Note. All effect sizes aggregated here were extracted from comparison versus experimental contrasts on immediate posttest results.

Random assignmentIndividual (all)470.700.76 0.48 –0.92
 O + C0
 O + L160.410.61 0.11–0.71
 E + C31.400.76 0.55–2.26
 E + L340.720.82 0.44–0.99
Group (all)140.390.59 0.08–0.70
 O + C1
 O + L0
 E + C120.470.60 0.13–0.81
 E + L2−0.100.09−0.23–0.03
None (all)450.670.75 0.45–0.89
 O + C110.430.68 0.03–0.83
 O + L100.581.25−0.19–1.36
 E + C210.770.59 0.51–1.02
 E + L80.900.42 0.61–1.20
Comparison groupYes (all)1050.640.74 0.50–0.78
 O + C120.520.72 0.11–0.92
 O + L260.480.89 0.13–0.82
 E + C360.720.64 0.51–0.93
 E + L440.710.77 0.49–0.94
No (all)0
 O + C0
 O + L0
 E + C0
 E + L0
PretestYes (all)800.620.71 0.47–0.78
 O + C60.390.85−0.29–1.07
 O + L120.250.75−0.17–0.68
 E + C320.680.66 0.45–0.91
 E + L390.680.76 0.45–0.92
No (all)260.720.85 0.40–1.05
 O + C60.640.61 0.16–1.13
 O + L140.670.99 0.15–1.18
 E + C41.070.34 0.73–1.41
 E + L50.940.88 0.17–1.71
Delayed posttestYes (Exp. only)630.900.64 0.74–1.06
 O + CNA
 O + LNA
 E + C270.850.62 0.62–1.08
 E + L350.930.68 0.70–1.15
No (Exp. only)160.200.46−0.03–0.42
 O + CNA
 O + LNA
 E + C90.330.58−0.05–0.71
 E + L70.030.14−0.07–0.13

Four reporting practices associated with quality were also investigated in relation to study outcomes. The results are reported in Table 9. The first two comparisons reveal larger effects for studies with nonpreferred reporting practices. First, there is a large and consistent difference, indicated by nonoverlapping confidence intervals, between studies that report (d = 0.42) and do not report reliability (d = 0.96). Studies that do not report a predetermined p-value (0.71) also appear to produce larger effects than the minority of studies that do (0.53), although this difference may not be reliable due to overlapping confidence intervals. Also related to statistical significance, effects were compared for studies that reported exact p-values and those that reported p-values as greater or less than a particular Type I error rate (e.g., .05, .10). There was virtually no difference between these two types of studies. Finally, whether a study reported an effect size did not appear to relate to the magnitude of its effects, with average d values of 0.62 and 0.66 for studies that did and did not report an effect size, respectively.

Table 9.  Subgroup analysis of effect sizes based on reporting practices associated with study quality
VariableValuekM (d)SD (d)95% Confidence intervals
  1. Note. All effect sizes aggregated here were extracted from comparison versus experimental contrasts on immediate posttest results.

ReliabilityReported (all)610.420.670.25–0.59
 O + C110.580.720.15–1.00
 O + L210.260.72−0.05–0.56 
 E + C170.530.740.18–0.88
 E + L240.500.680.23–0.78
Not reported (all)450.960.730.75–1.17
 O + C1
 O + L51.391.040.48–2.31
 E + C190.890.500.67–1.12
 E + L200.960.800.61–1.31
Preset p-valueYes (all)340.530.870.23–0.82
 O + C80.660.830.08–1.24
 O + L70.400.88−0.26–1.05 
 E + C150.510.750.13–0.89
 E + L150.631.080.08–1.17
No700.710.680.55–0.87
 O + C40.240.31−0.07–0.54 
 O + L190.500.920.09–0.92
 E + C190.930.520.70–1.16
 E + L290.760.560.55–0.96
p-valueExact740.670.740.51–0.84
 O + C50.750.91−0.05–1.55 
 O + L180.490.850.10–0.88
 E + C230.530.610.28–0.78
 E + L390.790.770.55–1.03
<or>830.620.700.47–0.77
 O + C120.520.720.11–0.92
 O + L210.320.85−0.04–0.69 
 E + C280.890.590.67–1.11
 E + L320.650.690.42–0.89
Effect sizeReported300.620.900.47–0.77
 O + C30.621.24−0.79–2.02 
 O + L110.370.72−0.06–0.79 
 E + C130.480.780.06–0.91
 E + L130.861.080.28–1.45
Not reported760.660.680.51–0.81
 O + C90.480.560.12–0.85
 O + L150.551.020.04–1.07
 E + C230.860.520.65–1.07
 E + L310.650.600.44–0.86

Research question 4 asked how different methods, analyses, reporting practices, and study outcomes have changed over 30 years of research on L2 interaction. This question was addressed by calculating the percentage of different types of studies from three 10-year intervals: 1980–1989, 1990–1999, 2000–2009.

Several changes can be observed in interactionist designs over time (see Figure 1). For example, observational research decreased from decade one to two and, conversely, experimental research increased. Most likely connected to the increase in experimental studies over time, we also see a greater percentage of studies employing both pretests and delayed posttests in more recent years. The use of random group assignment, another indication of study quality, also increased over time, despite a simultaneous increase in classroom studies (over lab studies), where it is more difficult to randomize participants to experimental conditions. Finally, the move toward higher quality methods is also evident in the increase of comparison groups over time.

Figure 1. Percentages of studies/samples with different types of designs across three decades.

Download figure to PowerPoint

image

As discussed in the literature review, many early studies of L2 interaction were interested in measuring the relative frequency of different conversational features (e.g., clarification requests), with later studies often looking to compare scores between groups on posttests. It is not surprising, then, to note that frequency-based analyses (e.g., chi-squares) decreased somewhat as the two most commonly used mean-based analyses—t tests and ANOVAs—increased steadily over time (see Figure 2). Perhaps somewhat surprisingly, there is little evidence for an increase in statistical sophistication in terms of the types of analyses performed in studies of interaction. Nevertheless, the number of statistical tests carried out in this body of research appears to increase over time. Figure 3 shows the percentage of studies from each decade that employed zero, one, and multiple statistical tests. We see a steady decrease over time in the portion of studies with only one test and, meanwhile, an increase in studies with multiple tests.

Figure 2. Percentages of studies/samples with different statistical analyses across three decades.

Download figure to PowerPoint

image

Figure 3. Percentages of studies using zero, one, and multiple statistical analyses across three decades.

Download figure to PowerPoint

image

Changes in reporting practices were also examined (see Figure 4), revealing several improvements in quality over time. Steady or sharp increases were found for the percentage of studies reporting instrument reliability, predetermined levels of statistical significance, exact p-values, and sufficient data to calculate a d value, which is tied to the increase in the reporting of standard deviations to accompany means. Four additional reporting practices stand out as incipient and positive advances, all of which were completely absent in interactionist research prior to the 2000s: (a) checking the assumptions of statistical tests, (b) power analysis, (c) effect sizes, and (d) confidence intervals. Finally, t and f statistics are both reported in an increasingly greater portion of studies over time. However, neither matches the percentage of studies that report analyses producing those statistics in any of the three decades.

Figure 4. Percentages of studies/samples reporting different types of data across three decades.

Download figure to PowerPoint

image

The general trend in the magnitude of effects from interactionist research is to decrease over time, with average d values of 1.62, 0.82, and 0.52 from the 1980s, 1990s, and 2000s, respectively (see Table 10). Along with a decrease in effect sizes, we also observe an increase in the number of studies published over time and in the precision of their aggregated effects, as indicated by increasingly narrow confidence intervals. Finally, although it was not the core objective of this study to meta-analyze the quantitative findings of the interactionist tradition, the average effect size (d) from this body of research is 0.65, which can be considered medium in relation to other effects in SLA (Oswald & Plonsky, 2010; Plonsky & Oswald, 2010).

Table 10.  Effect sizes across three decades
DecadeStudieskM (d)SD (d)95% Confidence interval
  1. Note. All effect sizes aggregated here were extracted from comparison versus experimental contrasts on immediate posttest results.

1980–1989451.620.720.99–2.25
1990–199913270.820.590.60–1.04
2000–200937750.520.730.36–0.69
All551070.650.740.51–0.79

Discussion

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

The primary goal of this study was to examine methodological quality in the interactionist tradition of SLA by quantifying methodological and reporting practices from both cumulative and developmental (i.e., historical) perspectives. In light of the multidimensional nature of “study quality” and in the absence of previous research of this type, this study did not intend to provide a categorical result of research on interaction as “high” or “low” in quality. Rather, this study and the first two research questions in particular asked about the extent to which a number of research and reporting practices relevant to quantitative research were present in studies of L2 interaction.

Overall Study Quality

The results of research questions 1 and 2 indicate several strengths in interactionist methods and reporting practices. The use of delayed posttests in approximately 80% of experimental studies, for example, is remarkable. By comparison, only 11 out of 59 studies (19%) in Plonsky's (in press) meta-analysis of L2 strategy instruction tested the endurance of treatment effects. Another strength across this body of studies is the concern over instrument reliability, reported in 64% of the sample (compared to only 6% of the sample meta-analyzed in Nekrasova & Becker, 2009, and 16% in Norris & Ortega, 2000). Although outside the scope of this article, a separate but arguably more critical issue is the actual reliability (and validity) of instruments in interactionist research regardless of whether a coefficient was reported (see Norris & Ortega, 2003).

Despite methodological rigor in these areas, weaknesses in the aggregate findings appear to outnumber the strengths. Pretesting is one example. Although the percentage of studies employing pretests (39%) is nearly equal to the percentage of (quasi-)experimental studies (38%), 14 of the latter (21%) did not pretest. Also related to research quality (and more specifically, generalizability) in experimental research, random assignment to experimental conditions has been overlooked often. Use of intact classes—whether for convenience or to preserve ecological validity—impedes random group assignment; any assessment of this type would be remiss not to take this into consideration. Nevertheless, with 58% of the studies in the sample being carried out in laboratory settings, only 32% employed random group assignment (37% including studies that assigned experimental conditions at the group level, compared to 53% in studies of L2 strategy instruction; Plonsky, in press). Another serious concern is statistical power, or the likelihood that a statistical test will not make a Type II error (i.e., that it will reject the null hypothesis when the alternative hypothesis is true). Power can also be interpreted as “the probability that the test will lead to a correct conclusion about the null hypothesis” (Murphy & Myors, 2004, p. vii) when a treatment such as interaction is believed to have an effect such as L2 development. To illustrate the problematic nature of this issue, let us consider a cumulative, post-hoc power analysis of the research described in this study: Assuming a mean n of 22 (see Table 4) and d value of 0.65 (see above) with alpha = .05, the overall power in interactionist research is approximately 0.56. We can interpret this figure, far below Cohen's (1988) recommended minimum power of 0.80, as indicating that typical quantitative studies of interaction may have only a 56% chance to appropriately detect statistical significance. This datum also provides large-scale, empirical support for sporadic warnings from SLA scholars about the debilitating effects of low power on the field's ability to accurately test hypotheses and advance our understanding (e.g., Crookes, 1991; Flahive & Ehlers-Zavala, 2010; Hauser, 2001; Larson-Hall, 2010; Lazaraton, 1991). But instead of speculating on whether or how those effects may have played out in this line of research, we would prefer to look forward. The solutions are the following: (a) precedence of practical over statistical significance (see, e.g., Norris & Ortega, 2006; Plonsky & Oswald, in press) and (b) larger sample sizes, a simple yet challenging proposition that might be realized by addressing fewer relationships between variables at the between-groups level. Of course, fewer multilevel designs such as this would yield fewer results per study, but fewer results are certainly preferable to more numerous results that lack reliability or power.

This study also identified several weaknesses in reporting practices. Before discussing results from this part of the study, some of the underlying logic and assumptions should be briefly revisited. The rationale behind this phase of the study was that (a) there is value in adhering to the guidelines put forth by the APA and others with influence in the field (e.g., journal editors) and therefore (b) thorough data reporting according to those guidelines is a valid measure of quality. However, it is important to keep in mind that completeness cannot be equated with overall study quality nor incompleteness with low quality (Wells & Littell, 2009); this article, therefore, takes the position that reporting practices are one of several aspects to consider in relation to overall study quality.

Related to the previous discussion on small sample size, Crookes’s (1991) claim that power analyses “are almost never used” (p. 762) still applies two decades later. Again, the almost complete absence of power analyses does not necessarily indicate low methodological quality or even low power, but possible causes of this condition (e.g., small samples, unfamiliarity with the implications of statistical power) may be symptomatic of other issues (see discussion below on the relationship between study quality and outcomes). The same could also be said for the lack of studies reporting checking whether the assumptions of statistical tests have been met.

Of more immediate consequence are inconsistencies in reporting practices such as exact and relative p-values, means without standard deviations, and t tests or ANOVAs without t or f values (and/or without group means and standard deviations). There are two main concerns with incomplete reporting of these types of data. First, as mentioned earlier and well-documented elsewhere (e.g., Wilson, 2009; see also Chan, Hróbjartsson, Haahr, Gøtzsche, & Altman, 2004), cumulative results obtained through meta-analysis are severely disadvantaged when the primary literature fails to report sufficient data to extract an effect size. Second, ignoring momentarily the importance of summarizing findings via meta-analysis, reporting of standard deviations to accompany means and other statistics is vital to consumers’ ability to understand and interpret the findings of individual studies. Also associated with meta-analysis and the inherently broad, quantitative approach it embodies on the accumulation of data, effect sizes and confidence intervals have been largely absent in the reports of this study.

These problems may be traced back to two distinct but nonexclusive conditions in the wider field of SLA. First, the findings described earlier point to both opaqueness and idiosyncrasy regarding what constitutes minimally sufficient data reporting, possibly resulting from inadequate training, as suggested by Lazaraton, Riggenbach, and Ediger (1987), who surveyed applied linguists on their familiarity with 23 different methodological and statistical concepts (see also Lazaraton, 2000; Norris & Ortega, 2003, 2006; Schmidt, 1996; Shohamy, 2000; Tversky & Kahneman, 1971). Based on the participants’ responses, “power” was found to be the second-to-least familiar, suggesting insufficient coverage of this topic in research methods courses. That said, many of the details involved in professional activities such as publishing fall outside the scope of our formal instruction. Although aspects of research design and reporting may not be emphasized sufficiently in graduate programs, a more likely explanation is a lack of standards imposed by editors and others who determine acceptability (and publishability) of SLA research. An alternate explanation for many of the systematic weaknesses identified by this study is researchers’ lack of concern with more fine-grained interpretations of their data and, conversely, synthetic research such as meta-analysis, both of which require more complete reporting of data. However, editors in SLA and other standard-bearers, so to speak, cannot be faulted entirely. In many cases, by upholding the value of statistical significance (i.e., p-values) over precision of findings or practical significance (expressed through confidence intervals and effect sizes), they are simply preserving a cultural norm in social science research. The unfortunate consequence of their inaction toward a more synthetically minded approach to primary research, however, is slower progress and greater inefficiency.

Study Quality and Outcomes

As described in the literature review, “study results are determined conjointly by the nature of the substantive phenomenon under investigation and the nature of the methods used to study it” (Lipsey, 2009, p. 150). Bearing this association in mind along with previous syntheses that have investigated outcomes in relation to research and reporting practices (e.g., Lipsey & Wilson, 1993; Plonsky, in press; Russell & Spada, 2006), this study also explored the effects in interactionist studies according to different designs as well as several indicators of methodological quality.

In short, the data analyzed to answer research question 3 indicate a mixed relationship between research and reporting practices and study outcomes. This study first examined the relative effects found in observational versus experimental and laboratory versus classroom studies, a matter of theoretical, methodological, and practical importance. Somewhat larger overall effects were found for experimental over observational studies. Upon examining the confidence intervals, however, we see that this difference may not be trustworthy and that the most precise mean effect of these four designs is found among experimental studies. There was no difference between studies carried out in classrooms and labs (see Gass et al., 2005). This result, perhaps unexpected considering the enhanced experimental control in lab research, contradicts the findings of at least three meta-analyses of L2 research that found consistently larger effects for lab- than classroom-based studies (Li, 2010; Mackey & Goo; 2007; Plonsky, in press). Although two of these meta-analyses also examined research in the interaction tradition, the discrepancy may be due to the substantially wider scope of this study. The relationship between research setting and study outcomes merits attention in future meta-analyses and methodological syntheses.

Regarding the relationship between methodological quality and study outcomes, we found very few clear patterns. However, for two indicators of study quality, studies of higher quality produced larger effects: When experimental and quasi-experimental studies included delayed posttests, they yielded a substantially larger mean effect on immediate posttests than when they did not, and this finding holds up regardless of whether they were conducted in the classroom (E + C) or in the laboratory (E + L). Likewise, there was a substantial advantage for studies that randomly assigned individual participants to conditions (compared to random assignment by group or class), although this finding did not hold when compared to studies that did not randomly assign participants at the individual or group level. We also found one case of studies with a nonpreferred feature—not reporting reliability—producing consistently larger effects (i.e., effects with nonoverlapping confidence intervals).

To be clear, there is no evidence to claim that quality affects outcomes. No one would suggest, for example, that randomly assigning participants to experimental conditions or reporting reliability causes larger or smaller effects (see Lipsey, 2009). Why, then, might studies employing certain research and reporting practices produce larger or smaller effects?Vacha-Haase and Thompson (2004) partially addressed this question, reminding us that “effect sizes are not magically independent of the designs that created them” (p. 478). However, their words do little to explain the relationship between effect sizes and other study/report features that could not possibly influence their outcome. One explanation for larger effects found among studies exhibiting features of higher quality (e.g., delayed posttesting in this study) is a correlation between methodological rigor (i.e., knowledge of how to design and report on a study) and an understanding of the variables being investigated (i.e., knowledge of what to study). It is not unreasonable to presume that researchers who are most familiar with the theoretical domain being investigated and who, by extension, are able to maximally exploit differences between variables and thus produce larger effects might also be most familiar with appropriate research design and reporting practices. From this perspective, we might modify Vacha-Haase and Thompson's (2004) statement with the following bracketed addendum: “Effect sizes are not magically independent of the designs [or the designers] that created them.” However, this argument would not hold up when smaller effects are associated with higher quality research practices, as we found with the reporting of reliability across studies of interaction. In this case, an additional variable—time or the decrease in effect sizes observed over time—appears to confound the relationship between study effects and this particular reporting practice, which increased over time. This point could also apply more generally. The increase in the number of studies published in the interactionist tradition over time, along with a simultaneous increase in study quality and a decrease in effect sizes, may have created a downward bias in subgrouped effects from studies of higher methodological quality. Of course, a third scenario—no significant relationship between methodological quality and study outcomes—is also possible and was indeed found for several variables in this study (see Wilson & Lipsey, 2001). Whatever the case, whether a relationship between methodological quality and study outcomes exists—in the field of SLA as a whole or across its many subdomains—is an empirical question yet to be answered.

Changes Over Time

Despite less-than-optimal quality found in the aggregate in the studies analyzed in this article, the results of research question 4 indicate improvements over time in several areas. Looking first at design-related features, noteworthy increases were observed in pretests, delayed posttests, the use of random group assignment, and comparison groups. As mentioned in the Results section of this article, greater use of delayed posttests and comparison groups may be conflated with the simultaneous increase in (quasi-)experimental studies. However, as inclusion of these design-related features continues to increase, greater trust can be placed in the findings of interactionist studies, particularly those claiming causal relationships. With respect to the use of different statistical procedures, it appears that research in this area has employed relatively few unique tests but has generally been in line with the APA's recommendation in favor of “minimally sufficient” (Wilkinson & Task Force on Statistical Inference, 1999, p. 598) analytical strategies. In fact, statistical analyses were limited almost exclusively to t tests/ANOVAs (which increased over time), chi-squares (which decreased over time), and correlations (which initially increased then decreased). In addition to the qualitative shift from frequency- to means-based analyses, there has also been an increase over time in the quantity of unique analyses that have been conducted per study. This may be an indication of greater statistical savvy or, more likely, that the depth of analyses in studies of L2 interaction has increased, requiring multiple statistical tests in order to answer all the research questions posed. An increase in statistical tests also means an increase in reliance on null hypothesis significance testing, which is problematic due to the historically low power in interactionist research as well as other reasons described earlier.

Similar to the trend observed in design-related features, several reporting practices also improved over time. Additionally, several more were reported for the first time in the 2000–2009 decade. Again, it is unclear whether these changes are unique to the interactionist tradition and whether they were caused by a change in researcher training, priorities of editors/reviewers, a combination of these factors, or something else. However, there are two likely explanations for the recent emergence of effect sizes and confidence intervals. In an editorial statement (N. C. Ellis, 2000), a “top-down” approach to improving research practices, Language Learning joined at least 23 other academic journals (see Vacha-Haase & Thompson, 2004) in the social sciences requiring that submitting authors “always present effect sizes and their confidence intervals for primary outcomes” (p. xii). (Since then, The Modern Language Journal and TESOL Quarterly have released similar policies with respect to effect sizes.) The impetus for change can and has also come from the bottom up. In the same issue as Ellis's editorial, Norris and Ortega (2000) argued for changes in how L2 research is carried out and disseminated, including greater use of effect sizes and confidence intervals and a diminished role of p-values. In this case, the authors’ concern went beyond how these statistics might improve individual studies. They also sought to improve reporting practices so that would-be meta-analysts would not have to exclude valuable data for a lack of descriptive statistics needed to calculate an effect size. Although flaws in reporting still remain, the present study has found an increase over time in the percentage of reports that include means, standard deviations, t values, and exact p-values, all of which can be used to calculate an effect size (e.g., d value). We can therefore conclude that the meta-analyzability of interactionist research has also increased over time.

In addition to designs and reporting practices, this study examined the historical development of outcomes in interaction research. The result was an unambiguous decrease in effect sizes over time. There are several possible and perhaps overlapping explanations for this finding. Recalling the two scenarios described earlier, smaller effect sizes in this case might be related to the maturity of the interactionist tradition as a theoretical and empirical domain. Earlier studies might have been interested in determining whether corrective feedback, for example, has an effect. The differences between groups (and effect sizes) in such studies would be considered relatively coarse (and large) in contrast with more recent studies examining differences among recasts, prompts, and metalinguistic feedback across multiple linguistic structures and/or proficiency levels, for example. While collecting data for this study, we observed another possible indication of the increase in theoretical nuance: The average length of the literature reviews increased over time. Although unorthodox and anecdotal, it stands to reason that more background information would be included in the literature review of a more thoroughly developed domain. There may also be a connection between the decrease in effect sizes and the increase in statistical testing (see above). The rationale here is that if more tests are conducted over time, the contrasts or relationships in question are likely of an increasingly subtle nature thus yielding smaller effects.

Conclusions and Directions for Future Research

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

Several limitations of empirical efforts in SLA have been documented in recent years (see Norris & Ortega, 2000; Oswald & Plonsky, 2010; Plonsky, in press). This article has responded to those concerns, as well as to calls for greater reflection more broadly speaking on L2 research practice (e.g., Ortega, 2005), by investigating the methodological quality of 174 quantitative studies in the interactionist tradition carried out from 1981 to 2009. The methods and reporting practices employed in interactionist research were quantified and discussed both cumulatively and over time. Several strengths and weaknesses were indentified. To apply these findings, we close with a few suggestions for improving future research on L2 interaction. We will also make recommendations for future inquiries into the relatively uncultivated domain of study quality in SLA.

Looking forward, research will be best served by maintaining and in some cases augmenting certain methodological practices such as calculating and reporting instrument reliability, randomly assigning participants to experimental conditions whenever possible, and continuing to improve the thoroughness with which it reports data, thus increasing both its interpretability and the portion of research that can later by synthesized via meta-analysis. It is also in the best interest of the interactionist tradition—and, by extension, the field of SLA more generally—to insist that changes occur in the way research is conducted and reported. Among the most critical of changes are pretesting, increasing power (via sample size; see Hauser, 2001), and more thorough reporting of data, specifically (a) exact p-values to accompany the results of all statistical tests, (b) standard deviations along with all means, (c) confidence intervals, and (d) effect sizes. These relatively simple changes will advance the interactionist research agenda significantly at both the individual study level and in the aggregate. As we alluded to earlier and as we have seen in other fields, however, progress in these areas is not likely in the absence of external and preferably top-down pressures. There are historical, cultural, and institutional traditions that tend toward preserving the status quo.

This article constitutes a preliminary step toward more fully understanding the use of research methods in SLA. However, it focused exclusively on quantitative studies carried out in the interactionist tradition. Future studies of quality in SLA research are needed of both a broad, disciplinewide scope as well as within other subdomains of L2 research. Such research will improve the field by helping us to better understand findings from the past and improve future research practices, ultimately leading to greater efficiency and progress in SLA.

Notes
  • 1

    One could argue that not using books across the 30-year span could bias the results. Our decision not to use books was based on the proliferation of books dealing with this topic and a lack of an objective means to determine which books to use and which not to use. Journals gave us a principled way to create a finite list of articles. The two books we did use were selected because, in our view, they were the earliest collection and the most recent collection (at the time the research was conducted) of books solely devoted to interaction-based research.

  • 2

    As one might expect, there were a number of studies that did not contribute d values to this step in the analysis because either insufficient data were reported to calculate an effect size or because of the categorical and/or correlational data and analyses. As noted in Table 6, sufficient data were reported to calculate a d value in 71 studies. However, effect sizes from only 57 studies were included in this phase of the analyses because (a) one study (Lai & Zhao, 2006) only reported sufficient data to calculate an effect size for a pre-post contrast (as opposed to a treatment-comparison contrast), which would not be appropriate to combine; (b) one study's effect sizes (Mackey, 1999) were not able to be calculated based on the published report but were found in Mackey and Goo (2007); (c) one study's effect sizes (Long et al., 1998) were based on means and standard deviations of gain scores, which, again, could not be combined appropriately with the rest of the effects; (d) in one study (Oliver, 1995), the means and standard deviations reported between groups were not based on a common measure and were therefore not appropriate for calculating an effect size; and (e) 12 studies with groups that could not be identified as clearly treatment or comparison conditions. Finally, although several other effect size indices were reported (e.g., η2, r), these were not included in the analyses due to the predominance of means-based analyses in the interactionist literature.

References

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information
  • References to studies included in the analysis appear as supporting information in Appendix S1 on the Language Learning Web site.
  • Blake, R. (2000). Computer mediated communication: A window on L2 Spanish interlanguage. Language Learning & Technology, 4, 120136.
  • Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. Chichester , UK : Wiley.
  • Brown, R. (1991). Group work, task difference, and second language acquisition. Applied Linguistics, 12, 112.
  • Burnham, J. C. (1990). The evolution of editorial peer review. Journal of the American Medical Association, 263, 13231329.
  • Byrnes, H. (2008). [Review of the book Synthesizing research on language learning and teaching]. Modern Language Journal, 92, 4850.
  • Carroll, S., & Swain, M. (1993). Explicit and implicit negative feedback: An empirical study of the learning of linguistic generalizations. Studies in Second Language Acquisition, 15, 357386.
  • Chan, A.-W., Hróbjartsson, A., Haahr, M. T., Gøtzsche, P. C., & Altman, D. G. (2004). Empirical evidence for selective reporting of outcomes in randomized trials. Journal of the American Medical Association, 291, 24572465.
  • Chapelle, C. A., & Duff, P. A. (2003). Some guidelines for conducting quantitative and qualitative research in TESOL. TESOL Quarterly, 37, 157178.
  • Chaudron, C. (2001). Progress in language classroom research: Evidence from The Modern Language Journal, 1916–2000. Modern Language Journal, 85, 5776.
  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale , NJ : Erlbaum.
  • Crookes, G. (1991). Power, effect size, and second language research. Another researcher comments. TESOL Quarterly, 25, 762765.
  • Day, R. R., Chenoweth, A. A., Chun, A. E., & Luppescu, S. (1984). Corrective feedback in native-nonnative discourse. Language Learning, 34, 1945.
  • DeKeyser, R., & Schoonen, R. (2007). Editors’ announcement. Language Learning, 57, ixx.
  • Derwing, T. M. (1989). Information type and its relation to nonnative speaker comprehension. Language Learning, 39, 157172.
  • Dörnyei, Z. (2007). Research methods in applied linguistics: Quantitative, qualitative and mixed methodologies. Oxford : Oxford University Press.
  • Doughty, C., & Pica, T. (1986). “Information gap” tasks: Do they facilitate second language acquisition TESOL Quarterly, 20, 305325.
  • Edge, J., & Richards, K. (1998). May I see your warrant, please? Justifying outcomes in qualitative research. Applied Linguistics, 19, 334356.
  • Egbert, J. (2007). Quality analysis of journals in TESOL and applied linguistics. TESOL Quarterly, 41, 157171.
  • Ellis, N. C. (2000). Editorial statement. Language Learning, 50, xixiii.
  • Ellis, R. (1985). Teacher-pupil interaction in second language development. In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 6985). Rowley , MA : Newbury House.
  • Ellis, R. (1999). Learning a second language through interaction. Philadelphia : Benjamins.
  • Felser, C. (2005). Experimental psycholinguistic approaches to second language acquisition. Second Language Research, 21, 9597.
  • Fern, E. F., & Monroe, K. B. (1996). Effect-size estimates: Issues and problems in interpretation. Journal of Consumer Research, 23, 89105.
  • Flahive, D., & Ehlers-Zavala, F. (2010, March). Power analysis in applied linguistics research. Paper presented at the Annual Conference of the American Association for Applied Linguistics, Altanta , GA .
  • Fotos, S. S. (1993). Consciousness raising and noticing through focus on form: Grammar task performance versus formal instruction. Applied Linguistics, 14, 385407.
  • Fotos, S., & Ellis, R. (1991). Communicating about grammar: A task-based approach. TESOL Quarterly, 25, 605628.
  • Gaies, S. J. (1981). Learner feedback and its effects in communication tasks: A pilot study. Studies in Second Language Acquisition, 4, 4659.
  • Gass, S. (1993). Editorial: Second language acquisition: Cross-disciplinary perspectives. Second Language Research, 9, 9598.
  • Gass, S. (1997). Input, interaction, and the second language learner. Mahwah , NJ : Erlbaum.
  • Gass, S. (2004). Conversation analysis and input-interaction. Modern Language Journal, 88, 597616.
  • Gass, S., & Lewis, K. (2007). Perceptions about interactional feedback: Differences between heritage language learners and non-heritage learners. In A.Mackey (Ed.), Conversational interaction in second language acquisition: A collection of empirical studies (pp. 7999). Oxford : Oxford University Press.
  • Gass, S., & Mackey, A. (2000). Stimulated recall methodology in second language research. Mahwah , NJ : Erlbaum.
  • Gass, S., & Mackey, A. (2006). Input, interaction and output: An overview. In K.Bardovi-Harlig & Z.Dörnyei (Eds.), AILA Review (pp. 317). Amsterdam : Benjamins.
  • Gass, S., & Mackey, A. (2007). Input, interaction and output in second language acquisition. In J.Williams & B.VanPatten (Eds.), Theories in second language acquisition (pp. 175299). Mahwah , NJ : Erlbaum.
  • Gass, S., Mackey, A., & Ross-Feldman, L. (2005). Task-based interactions in classroom and laboratory settings. Language Learning, 55, 575611.
  • Gass, S., & Madden, C. G. (Eds.) (1985). Input in second language acquisition. Rowley , MA : Newbury House.
  • Gass, S., & Varonis, E. M. (1985a). Task variation and nonnative/nonnative negotiation of meaning. In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 149161). Rowley , MA : Newbury House.
  • Gass, S., & Varonis, E. M. (1985b). Variation in native speaker speech modification to non-native speakers. Studies in Second Language Acquisition, 7, 3758.
  • Godfroid, A., Housen, A., & Boers, F. (2010, March). Attention and awareness in SLA: An attempt to refine the construct of “noticing. Paper presented at the conference of the American Association for Applied Linguistics (AAAL), Atlanta , GA .
  • Hall, J. K. (2010). Interaction as method and result of language learning. Language Teaching, 43, 202215.
  • Hatch, E., & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguistics. Boston : Heinle & Heinle.
  • Hauser, E. (2001, October). The statistical power of second language acquisition research: A review. Paper presented at the Pacific Second Language Research Forum, University of Hawaii at Manoa .
  • Hawkins, B. (1985). Is an “appropriate response” always so appropriate? In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 162177). Rowley , MA : Newbury House.
  • Henning, G. (1986). Quantitative methods in language acquisition research. TESOL Quarterly, 20, 701708.
  • Hirvonen, T. (1985). Children's foreigner talk: Peer talk in play context. In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 137148). Rowley , MA : Newbury House.
  • Hopewell, S., Clarke, M., & Mallet, S. (2005). Grey literature and systematic reviews. In H. R.Rothstein, A. J.Sutton, & M.Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessments, and adjustments (pp. 4972). Chichester : Wiley.
  • Iwashita, N. (2003). Negative feedback and positive evidence in task-based interaction: Differential effects on L2 development. Studies in Second Language Acquisition, 25, 136.
  • Jefferson, T. O., Alderson, P., Davidoff, F., & Wager, E. (2003). Editorial peer-review for improving the quality of reports of biomedical studies (Cochrane Methodology Review). In The Cochrane Library, Issue 4. Chichester UK : Wiley.
  • Journal Article Reporting Standards Working Group (2008). Reporting standards for research in psychology: Why do we need them? What might they be American Psychologist, 63, 839851.
  • Keck, C., Iberri-Shea, G., Tracy-Ventura, N., & Wa-Mbaleka, S. (2006). In J. M.Norris & L.Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 91131). Philadelphia : Benjamins.
  • Kelm, O. (1992). The use of synchronous computer networks in second language instruction: A preliminary report. Foreign Language Annals, 25, 441454.
  • Kim, Y., & McDonough, K. (2008). The effect of interlocutor proficiency on the collaborative dialogue between Korean as a second language learners. Language Teaching Research, 12, 211234.
  • Kleifgen, J. A. (1985). Skilled variation in a kindergarten teacher's use of foreigner talk. In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 5968). Rowley , MA : Newbury House.
  • Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington , DC : American Psychological Association.
  • Krashen, S. (1977). Some issues relating to the monitor model. In H.Brown, C.Yorio, & R.Crymes (Eds.), On TESOL ’77: Teaching and learning English as a second language: Trends in research and practice (pp. 144158). Washington , DC : Teachers of English to Speaker of Other Languages.
  • Krashen, S. (1982). Principles and practice in second language acquisition. London : Pergamon.
  • Krashen, S. (1985). The input hypothesis: Issues and implications. New York : Longman.
  • Lai, C., & Zhao, Y. (2006). Noticing and text-based chat. Language Learning & Technology, 10, 102120.
  • Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS. New York : Routledge.
  • Lazaraton, A. (1991). Power, effect size, and second language research. A researcher comments. TESOL Quarterly, 25, 759762.
  • Lazaraton, A. (2000). Current trends in research methodology and statistics in applied linguistics. TESOL Quarterly, 34, 175181.
  • Lazaraton, A., Riggenbach, H., & Ediger, A. (1987). Forming a discipline: Applied linguists’ literacy in research methodology and statistics. TESOL Quarterly, 21, 263277.
  • Leow, R. P. (2000). A study of the role of awareness in foreign language behavior. Studies in Second Language Acquisition, 22, 557584.
  • Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis. Language Learning, 60, 309365.
  • Lin, Y.-H., & Hedgcock, J. (1996). Negative feedback incorporation among high-proficiency and low-proficiency Chinese-speaking learners of Spanish. Language Learning, 46, 567611.
  • Lipsey, M. W. (2009). Identifying interesting variables and analysis opportunities. In H.Cooper, L. V.Hedges, & J. C.Valentine (Eds.), The handbook of research synthesis (2nd ed., pp. 147158). New York : Russell Sage Foundation.
  • Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 11811209.
  • Loewen, S., & Gass, S. (2009). The use of statistics in L2 acquisition research. Language Teaching, 42, 181196.
  • Loewen, S., & Philp, J. (2006). Recasts in the adult English L2 classroom: Characteristics, explicitness, and effectiveness. Modern Language Journal, 90, 536556.
  • Long, M. H. (1980). Input, interaction and second language acquisition. Unpublished doctoral dissertation, University of California , Los Angeles .
  • Long, M. H. (1983a). Linguistic and conversational adjustments to non-native speakers. Studies in Second Language Acquisition, 5, 177193.
  • Long, M. H. (1983b). Native speaker/non-native speaker conversation and the negotiation of comprehensible input. Applied Linguistics, 4, 126141.
  • Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W. C.Ritchie & T. K.Bhatia (Eds.), Handbook of research on language acquisition: Vol. 2. Second language acquisition (pp. 413468). San Diego , CA : Academic Press.
  • Long, M. H., Inagaki, S., & Ortega, L. (1998). The role of implicit negative feedback in SLA: Models and recasts in Japanese and Spanish. Modern Language Journal, 82, 357371.
  • Loschky, L., & Bley-Vroman, R. (1993). Grammar and task-based methodology. In G.Crookes & S.Gass. (Eds.), Tasks and language learning: Integrating theory and practice (pp. 123167). Clevedon , UK : Multilingual Matters.
  • Lyster, R., & Mori, H. (2006). Interactional feedback and instructional counterbalance. Studies in Second Language Acquisition, 28, 269300.
  • Lyster, R., & Saito, Y. (2010). Oral feedback in classroom SLA: A meta-analysis. Studies in Second Language Acquisition, 32, 265302.
  • Mackey, A. (1999). Input, interaction, and second language development: An empirical study of question formation in ESL. Studies in Second Language Acquisition, 21, 557587.
  • Mackey, A. (2006). Feedback, noticing and instructed second language learning. Applied Linguistics, 27, 405430.
  • Mackey, A. (Ed.) (2007a). Conversational interaction in second language acquisition: A collection of empirical studies. Oxford : Oxford University Press.
  • Mackey, A. (2007b). Introduction: The role of conversational interaction in second language acquisition. In A.Mackey (Ed.), Conversational interaction in second language acquisition: A collection of empirical studies (pp. 126). Oxford : Oxford University Press.
  • Mackey, A., & Gass, S. (2005). Second language research: Methodology and design. Mahwah , NJ : Erlbaum.
  • Mackey, A., & Gass, S. (2006). Introduction. Studies in Second Language Acquisition, 28, 169178.
  • Mackey, A., & Gass, S. (in press). Research methodologies in second language acquisition. London : Basil Blackwell.
  • Mackey, A., & Goo, J. (2007). Interaction research in SLA: A meta-analysis and research synthesis. In A.Mackey (Ed.), Conversational interaction in second language acquisition: A collection of empirical studies (pp. 407449). Oxford : Oxford University Press.
  • Mackey, A., & Philp, J. (1998). Conversational interaction and second language development: Recasts, responses, and red herrings Modern Language Journal, 82, 338356.
  • Mackey, A., & Polio, C. (Eds.). (2009). Multiple perspectives on interaction: Second language research in honor of Susan M. Gass. New York : Routledge.
  • Magnan, S. S. (1994). From the editor: The MLJ tradition and the challenges ahead. Modern Language Journal, 78, 79.
  • Magnan, S. S. (2007). Commentary: The promise of digital scholarship is SLA research and language pedagogy. Language Learning & Technology, 11, 152155.
  • Muranoi, H. (2000). Focus on form through interaction enhancement: Integrating formal instruction into a communicative task in EFL classrooms. Language Learning, 50, 617673.
  • Murphy, K. R., & Myors, B. (2004). Statistical power analysis. Mahwah , NJ : Erlbaum.
  • Nagata, N. (1997). An experimental comparison of deductive and inductive feedback generated by a simple parser. System, 25, 515534.
  • Nekrasova, T., & Becker, T. (2009). Effectiveness of practice: A research synthesis and quantitative meta-analysis. Manuscript in preparation.
  • Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417528.
  • Norris, J. M., & Ortega, L. (2003). Defining and measuring SLA. In C. J.Doughty & M. H.Long (Eds.), The handbook of second language acquisition (pp. 717761). Malden , MA : Blackwell.
  • Norris, J. M., & Ortega, L. (2006). The value and practice of research synthesis for language learning and teaching. In J. M.Norris & L.Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 350). Philadelphia : Benjamins.
  • Ohta, A. S. (2000). Re-thinking interaction in SLA: Developmentally appropriate assistance in the zone of proximal development and the acquisition of L2 grammar. In J. P.Lantolf (Ed.), Sociocultural theory and second language learning (pp. 5178). Oxford : Oxford University Press.
  • Oliver, R. (1995). Negative feedback in child NS-NNS conversation. Studies in Second Language Acquisition, 17, 459481.
  • Ortega, L. (2005). Methodology, epistemology, and ethics in instructed SLA research: An introduction. Modern Language Journal, 89, 317327.
  • Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and challenges. Annual Review of Applied Linguistics, 30, 85110
  • Philp, J. (2003). Constraints on “noticing the gap”: Nonnative speakers’ noticing of recasts in NS-NNS interaction. Studies in Second Language Acquisition, 25, 99126.
  • Pica, T. (1987). Second language acquisition, social interaction, and the classroom. Applied Linguistics, 8, 321.
  • Pica, T., & Doughty, C. (1985a). Input and interaction in the communicative language classroom: A comparison of teacher-fronted and group activities. In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 115132). Rowley , MA : Newbury House.
  • Pica, T., & Doughty, C. (1985b). The role of group work in classroom second language acquisition. Studies in Second Language Acquisition, 7, 233248.
  • Pica, T., Holliday, L., Lewis, N., & Morgenthaler, L. (1989). Comprehensible output as an outcome of linguistic demands on the learner. Studies in Second Language Acquisition, 11, 6390.
  • Plonsky, L. (in press). The effectiveness of the second language strategy instruction: A meta-analysis. Language Learning, 61.
  • Plonsky, L. (forthcoming). Study quality: Assessing quantitative research methods and reporting practices in SLA. Unpublished doctoral dissertation, Michigan State University .
  • Plonsky, L., & Oswald, F. L. (2010, March). Interpreting mean differences in L2 research: Cohen's d-value, revisited. Paper presented at the conference of the American Association for Applied Linguistics, Atlanta , GA .
  • Plonsky, L., & Oswald, F. L. (in press). How to do a meta-analysis. In A.Mackey & S.Gass (Eds.), A guide to research methods in second language acquisition. London : Basil Blackwell.
  • Polio, C., Gass, S., & Chapin, L. (2006). Using stimulated recall to investigate native speaker perceptions in native-nonnative speaker interaction. Studies in Second Language Acquisition, 28, 237267.
  • Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112, 160164.
  • Publication manual of the American Psychological Association (6th ed.). (2010). Washington , DC : American Psychological Association.
  • Read, J. (2007). Towards a new collaboration: Research in SLA and language testing. New Zealand Studies in Applied Linguistics, 13, 2235.
  • Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition of L2 grammar: A meta-analysis of the research. In J. M.Norris & L.Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 133164). Philadelphia : Benjamins.
  • Sato, C. J. (1986). Conversation and interlanguage development: Rethinking the connection. In R. R.Day (Ed.), Talking to learn: Conversation in second language (pp. 2345). Rowley , MA : Newbury House.
  • Scarcella, R. C., & Higa, C. (1981). Input, negotiation, and differences in second language acquisition. Language Learning, 31, 409437.
  • Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training researchers. Psychological Methods, 1, 115129.
  • Schmidt, R. (2001). Attention. In P.Robinson (Ed.), Cognition and second language instruction (pp. 332). New York : Cambridge University Press.
  • Sheen, Y. (2006). Exploring the relationship between characteristics of recasts and learner uptake. Language Teaching Research, 10, 361392.
  • Sheen, Y. (2007). The effects of corrective feedback, language aptitude, and learner attributes on the acquisition of English articles. In A.Mackey (Ed.), Conversational interaction in second language acquisition: A collection of empirical studies (pp. 301322). Oxford : Oxford University Press.
  • Shohamy, E. (2000). The relationship between language testing and second language acquisition, revisited. System, 28, 541553.
  • Simard, D., & Wong, W. (2001). Alertness, orientation and detection. Studies in Second Language Acquisition, 23, 103124.
  • Smith, B., & Lafford, B. A. (2009). The evaluation of scholarly activity in computer-assisted language learning. Modern Language Journal, 93, 868883.
  • Swain, M. (1985). Communicative competence: Some roles of comprehensible input and comprehensible output in its development. In S.Gass & C. G.Madden (Eds.), Input in second language acquisition (pp. 235253). Rowley , MA : Newbury House.
  • Takimoto, M. (2006). The effects of explicit feedback and form-meaning processing on the development of pragmatic proficiency in consciousness-raising tasks. System, 34, 601614.
  • Tomasello, M., & Herron, C. (1989). Feedback for language transfer errors: The garden path technique. Studies in Second Language Acquisition, 11, 385395.
  • Trikalinos, T. A., et al. (2004). Effect sizes in cumulative meta-analyses of mental health randomized trials evolved over time. Journal of Clinical Epidemiology, 57, 11241130.
  • Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105110.
  • Vacha-Haase, T., & Thompson, B. (2004). How to estimate and interpret various effect sizes. Journal of Counseling Psychology, 51, 473481.
  • Valentine, J. C., & Cooper, H. (2008). A systematic and transparent approach for assessing the methodological quality of intervention effectiveness research: The study design and implementation assessment device (Study DIAD). Psychological Methods, 13, 130149.
  • VanPatten, B., & Williams, J. (2002). Research criteria for tenure in second language acquisition: Results from a survey of the field. Unpublished manuscript, The University of Illinois at Chicago .
  • Wa-Mbaleka, S. (2006). A meta-analysis investigating the effects of reading on second language vocabulary learning. Unpublished doctoral dissertation, Northern Arizona University , Flagstaff , AZ .
  • Wells, K., & Littell, J. H. (2009). Study quality assessment in systematic reviews of research on intervention effects. Research on Social Work Practice, 19, 5262.
  • Wilkinson, L., & Task Force on Statistical Inference (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594604.
  • Williams, J. (1999). Learner-generated attention to form. Language Learning, 49, 583625.
  • Wilson, D. B. (2009). Systematic coding. In H.Cooper, L. V.Hedges, & J. C.Valentine (Eds.), The handbook of research synthesis (2nd ed., pp. 159176). New York : Russell Sage Foundation.
  • Wilson, D. B., & Lipsey, M. W. (2001). The role of method in treatment effectiveness research: Evidence from meta-analysis. Psychological Methods, 6, 413442.

Supporting Information

  1. Top of page
  2. Abstract
  3. Assessing Methodological Study Quality in Primary Research
  4. A Brief History of Theory and Research on L2 Interaction
  5. Research Questions
  6. Method
  7. Results
  8. Discussion
  9. Conclusions and Directions for Future Research
  10. References
  11. Supporting Information

Appendix S1. Studies Analyzed.

FilenameFormatSizeDescription
LANG_640_sm_AppendixS1.doc74KSupporting info item

Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.