Evaluating the reading and listening outcomes of beginning ‐ level Duolingo courses

The Challenge As more and more learners use digital apps to learn languages, it is important for the field of language learning to understand the effectiveness of these apps. This article presents the listening and reading proficiency of Duolingo learners when they reach the end of its beginning ‐ level Spanish and French courses.

proficiency tests. Further limiting the generalizability of findings in this area, many studies have focused on specific technological tools such as an audio-graphic conferencing system (Hampel & Hauck, 2004;Hampel, 2003), Wimba Voice (Gleason & Suvorov, 2011), and voice blogs (Sun, 2012).
Among the various models of online language learning, this study focused on online courses offered by a commercial provider via language learning apps. As background to the present study, we review a number of other efficacy studies based on Duolingo and other commercially available apps. We also review studies on the proficiency outcomes of university language programs which serve as a source of comparison for language gains made by the participants in the present study.

| Effectiveness of commercial online language learning products
Due to the commercial nature of their products, companies sometimes hire researchers to carry out commissioned research. There is a noteworthy set of commissioned studies by Vesselinov and Grego across five online language learning products: Rosetta Stone (Vesselinov, 2009), Duolingo (Vesselinov & Grego, 2012), Babbel (Vesselinov & Grego, 2016a), Busuu (Vesselinov & Grego, 2016b), and Italki (Vesselinov & Grego, 2018). These research reports were published on company websites as white papers, not peer-reviewed journal articles. In these studies, the researchers followed a pretest-posttest design and investigated the effectiveness of the Spanish learning products of each company's product. The participants were non-Hispanic learners between the ages of 19-69, with a below-advanced Spanish proficiency. All five studies used the Web-based Computer Adaptive Placement Exam (WebCAPE, an adaptive exam that assesses vocabulary, reading, and grammar) as the primary data collection instrument and reported teaching effectiveness based on the points gained from the pretest to posttest and points gained per hour of study. Three studies (Vesselinov & Grego, 2016bVesselinov, 2009) also used the ACTFL Oral Proficiency Interview-computer version (OPIc) to assess participants' development in speaking ability. Overall, learners showed gains in WebCAPE points and some percentage of learners leveled up in ACTFL OPIc ratings. Due to the differences of pretest WebCAPE scores and OPIc levels, it is hard to compare the effectiveness across these products based only on gained points, gained points per hour of study, or percentage of learners leveling up. For the findings to be meaningful, this set of studies would have benefited from the more rigorous research designs, for example, control of prior proficiency, control of time on task, use of comparison groups, and use of more interpretable proficiency tests.
In two recent studies, Loewen and colleagues have also investigated the efficacy of online language learning products (Loewen et al., 2019(Loewen et al., , 2020. In a collaboration between two academics and a Babbel internal researcher, Loewen et al. (2020) examined the effectiveness of Babbel for learning Spanish. The study involved 54 participants who used Babbel to study Spanish for a minimum of 15 min/day during a period of approximately three months. The participants were college graduate and undergraduate students with an average age of 24 years and had an average of two classroom-based Spanish courses before the study. The study followed a pretest and posttest design based on measures of ACTFL OPIc, grammar, and vocabulary. The researchers found that after an average of approximately 12 h of learning on Babbel within 12 weeks, learners increased their oral proficiency by 0.7 ACTFL sublevels and made significant gains on grammar and vocabulary. The learning gains were associated with the duration of time participants spent on Babbel and their overall level of interest in learning Spanish. Loewen et al. (2019) is a case study on learning beginner-level Turkish with Duolingo. Unlike Loewen et al. (2020), the researchers of this study served as participants themselves. The researcher-participants were a professor and eight graduate students who were experienced language learners as well as researchers in language learning. They carried out the project to fulfill an obligatory class requirement. These true beginners of Turkish used Duolingo at least 1 h/week for 12 weeks. They were assessed with a summative achievement test which was used for a first-semester university-level Turkish class (Turkish 151 at the institution where the research was conducted). After an average of 29 h of learning Turkish on Duolingo, only one participant reached 70% of mastery on the Turkish 151 test. However, it is unclear whether the Turkish 151 test, designed for a particular university class, was appropriate as an outcome measure in the study. As an achievement test, the test might have strong content validity for the Turkish 151 class because it tested what had been taught in that class, but would not necessarily be appropriate to assess learning on Duolingo or any other program of instruction.
In contrast with the single-sample studies described thus far, some online language learning products have been compared with traditional classroom instruction and no evidence of disadvantage has been identified. Lord (2015Lord ( , 2016 investigated the effectiveness of Rosetta Stone with data from 12 true beginners during a 16-week academic semester. The participants of the study were enrolled in a university beginning-level Spanish course. They were divided into three groups: a control group, a Rosetta Stone group, and a group that used Rosetta Stone materials as a course text in class, with four learners in each group. Two assessments were used at the end of the semester: the vocabulary and grammar portion of the Spanish College Level Examination Program (CLEP) test and the Versant Automated Oral Proficiency Test in Spanish. No significant differences were observed between the three groups on either measure, even though qualitative differences were noticed in the interview scripts favoring the control group. In addition to concerns related to the study's small sample, a substantial difference between groups was observed for time-on-task, with the control group averaging 109 h of learning and the Rosetta Stone group averaging only 48 h of learning.
In another recent study, Rachels and Rockinson-Szapkiw (2018) compared online language learning products with traditional classroom instruction. The authors employed a pretestposttest design to compare face-to-face Spanish classroom instruction with Duolingo's Spanish course for English speakers in an elementary school. The participants of the study were 164 students from 11 third-and fourth-grade classes. Students from six classes used Duolingo to learn Spanish while the other five classes attended regular face-to-face Spanish classes. Both groups learned Spanish for 40 min/week for 12 weeks. Students were assessed on Spanish vocabulary and grammar with multiple-choice items. The same test was used in pretest and posttest. The researchers found no significant difference between the two groups and concluded that Duolingo was a useful tool for teaching Spanish to elementary students.
Several of the studies in this small set of studies provide some evidence of the effectiveness of online language learning products, indicating improvements in linguistic knowledge and no disadvantage compared to face-to-face learning. However, a few issues are noteworthy. First, there is a lack of involvement of independent researchers (see Lord, 2015Lord, , 2016, for a notable exception). The studies across different products were limited in the variety of authorship. For example, Vesselinov (and Grego) carried out commissioned studies on Rosetta Stone, Duolingo, Babbel, Busuu, and Italki. Loewen and colleagues have conducted studies on Babbel and Duolingo. The lack of research by academic scholars on commercial language learning products has been observed by several researchers (e.g., Heift & Chapelle, 2012;Plonsky & Ziegler, 2016;Smith, 2017), who called for more participation of language learning researchers and educators in exploring the effectiveness of commercial products. Loewen et al. (2020) attributed the lack of scholarly interest to a number of reasons, including researchers' limited control when utilizing apps and their deterrence by the commercial nature of the apps. These reasons seem highly relevant and worthy of concern for the potential threat they present to the internal validity of this line of research. As the language learning field calls for rigorous research into the efficacy of commercial products, one way to address these concerns is to allow collaboration between external scholars and internal researchers, as in the study by Loewen et al. (2020), where university researchers and an internal Babbel researcher collaborated and co-authored the paper. The team involved in the present study, likewise, involves both industry-and university-based researchers. Even more trustworthy evidence might come from researchers who are completely independent of commercial entities.
Second, the outcome measures used in the studies were, in many cases, less than ideal. For example, as described above, Loewen et al. (2019) used a summative achievement test for a university class to assess learning on Duolingo; Vesselinov (and Grego) used a placement exam (WebCAPE) in all five studies they conducted. In the case of Vesselinov and Grego (2012, 2016a, 2016b, the researchers defined product efficacy as a gain of WebCAPE points per hour of study and provided estimates on the number of hours of study needed to be placed out of the first-semester university language course. Such findings were not only hard to interpret out of the immediate context, but can also be seen as making unwarranted claims. As a result, some scholars have expressed skepticism about some of the claims commercial language learning products have made about learner success, calling for more rigorous, researchbased proficiency assessments (Tarone, 2015;van Deusen-Scholl, 2015).

| Proficiency outcomes of university language programs
As hubs of language learning, university-based language courses provide one possible source of comparison of the effectiveness of commercial online language learning products. Both settings have sought in recent years to move toward proficiency-based instruction and outcomes (e.g., Cox et al., 2018). In line with this movement, the Language Flagship Proficiency Initiative, supported by a grant from the National Security Education Program (Winke et al., 2014(Winke et al., -2017, has funded the administration of proficiency assessments for language learners at the University of Utah, the University of Minnesota (Twin Cities), and Michigan State University. Students at varying semesters of undergraduate study (second to eighth semester) were assessed with the ACTFL Listening Proficiency Test (LPT), Reading Proficiency Test (RPT), and OPIc in 10 different languages with over 20,000 scores. Several publications have been available to professionals in language education related to the foreign language proficiency test data (Winke et al., 2014(Winke et al., -2017 provided by the Flagship Proficiency Initiative (e.g., Rubio & Hacking, 2019;Tschirner, 2016). Considering the scope of the current study, the following review focuses on studies that reported the listening and reading proficiency levels of university students in Spanish and French. Tschirner (2016) reported listening and reading proficiency levels at different milestones of undergraduate study based on data from more than 3000 participants learning seven languages at 21 institutions across the United States, although the majority of the test scores were from the foreign language proficiency test data (Winke et al., 2014(Winke et al., -2017. More concretely, ACTFL LPTs and RPTs were administered to first-, second-, third-, and fourth-year students from 2014 to 2015. Data were collected from learners of French, German, Japanese, Italian, Portuguese, Russian, and Spanish. The main findings were reported based on listening and reading proficiency levels in Spanish and French, which made up 82% of all tests completed. In both languages, there was a steady increase in proficiency levels over the semesters in both listening and reading, but listening proficiency levels were substantially lower than reading. By the end of the fourth semester, on average, students reached Intermediate Low (IL) in reading proficiency, but their listening proficiency was Novice Mid (NM), approaching Novice High (NH). Notably, the findings from Rubio and Hacking (2019), which reported findings from all three institutions of the Flagship Proficiency Initiative, were very consistent with those of Tschirner (2016): Among other results, after four semesters of instruction, reading reached IL in Spanish and French, but listening remained at NH. Soneson and Tarone (2019) reported data from the Proficiency Assessment for Curricular Enhancement (PACE) project on ACTFL assessments of speaking, listening, and reading of seven languages at the University of Minnesota. Their findings reveal somewhat more rapid gains compared to Tschirner (2016) and Rubio and Hacking (2019). After two semesters of instruction, students in Spanish and French reached IL in reading, NH in listening, and IL in speaking. After four semesters of instruction, students reached Intermediate Mid (IM) in reading, IL in listening, and IM in speaking. The discrepancy between Soneson and Tarone (2019), on one hand, and the findings of other studies in the Flagship Proficiency Initiative, on the other, might be due to differences in instruction/exposure. According to Strawbridge et al. (2019), the PACE project was based on an enhanced curriculum that required five credit hours per semester, which was very likely more than language programs in other institutions that offer three or four credit hours per semester.
Similar to the studies described in this section thus far, Strawbridge et al. (2019) sought to track learners' speaking, listening, and reading proficiency ratings in French and Spanish in postsecondary programs. The researchers found that second-and fourth-semester students of both languages scored significantly lower in listening than in reading and speaking. At the end of the fourth semester, students in both languages reached IM in reading, IL in listening, and IM in speaking. However, as mentioned in relation to Soneson and Tarone (2019), the language programs under investigation offered five credit hours per semester for the first four semesters of language study.
Finally, Winke et al. (2020) provided a proficiency profile of the university undergraduate students in six languages (Arabic, Chinese, French, Portuguese, Russian, and Spanish) based on ACTFL speaking, listening, and reading proficiency data collected in Spring, 2017 of the Language Flagship Proficiency Initiative. Their findings largely resemble those of the other studies reviewed thus far: among students who are non-heritage speakers, fourth-semester French students achieved IL in both reading and listening; fourth-semester Spanish students reached IL in reading and NH in listening.
In sum, among the five studies reviewed in this section, Tschirner (2016), Rubio and Hacking (2019), and Winke et al. (2020) demonstrated that students at the end of the fourth semester of Spanish and French courses reached IL in reading proficiency and between NM and IL in listening proficiency, while Soneson and Tarone (2019) and Strawbridge et al. (2019) reported results of one ACTFL sublevel higher. The higher proficiency reported in these two studies seemed to be based on one language program which offers an enhanced curriculum. Across studies, four semesters of the study consisted of a range of two to five credit hours per semester for a total of eight to 20 credit hours total across semesters. These findings are summarized in Table 1.
Overall, the body of literature reviewed here can be summarized as follows. First, there is fairly strong evidence that technology-based instruction can be effective in fostering second language development. Such gains are especially robust when training occurs in the context of larger educational programs such as those offered by tertiary institutions. Evidence of the effectiveness of webbased language-learning apps also appears to be accumulating. However, as noted above, this line of investigation is somewhat limited not only by the number of investigations available to date but by certain study design features such as choice of outcome measures, sample sizes, and the lack of collaboration between independent (i.e., university-affiliated) and industry-based researchers. Finally, large-scale studies of university-based language learning provide a fairly clear picture of the range of proficiency-based outcomes at different levels of instruction. Considering the overlapping interests across these domains, of primary interest to the present study is to compare such outcomes with those of the users of commercially available apps to assess-using a standardized assessment-the effectiveness and efficiency of the latter.

| The current study
The current study aimed to shed light on the question of what proficiency outcomes Duolingo learners can expect to achieve. We did so by measuring the listening and reading proficiency levels of Duolingo learners who had completed the beginning-level material in the Spanish and French courses. (Follow-up studies will investigate the effectiveness of Duolingo for other skills such as speaking and writing). To better understand the proficiency levels learners have reached and the means to get there, user activity data such as time spent on learning were also analyzed. Finally, learners' proficiency levels as measured by standardized test scores were compared with the proficiency outcomes of students enrolled in US-based university language courses.

| Duolingo course structure
The beginning-level content of a Duolingo course includes five sections, each of which concludes with a "checkpoint" (see Figure 1, left). Sections consist of "skills," which are sets of lessons on a functionally coherent topic (e.g., Travel or School). There are a total of 114 skills in the beginning level of the Spanish course and 99 skills in the beginning level of the French course, as shown in Table 2. Each skill includes five difficulty levels with four to five lessons at | 7 each level, where the higher difficulty is achieved through exercises requiring progressively more recall and production. For example, the sentence-building exercise in Figure 1 (middle) is relatively easy compared to a similar exercise without a word bank. Learners are required to complete at least one difficulty level in each row to move on to the next row. Duolingo uses a comprehension-based approach to foster long-term retention and to promote communication in the new language (for a review of evidence on the effects of comprehension-based instruction, see Shintani et al., 2013). The courses expose learners to vocabulary and grammar in sentences in the target language such that learners will gradually infer linguistic regularities from repeated exposure to and engagement with meaningful input. Furthermore, Duolingo lessons complement more implicit, comprehension-based learning with explicit feedback and explanations. For some structures, explicit explanation can offer a shortcut to more efficient learning. This is especially the case for features of the target language that may be difficult to notice from input alone (DeKeyser, 2003;Ellis, 2015). Duolingo courses also include longer-form, discourse-level content in the form of interactive story exercises (see Figure 1, right), which provide learners with opportunities to practice listening and reading skills. These exercises provide a real-world context for language use, demonstrate how language is organized beyond the sentence level, and feature more interactive and social aspects of the target language. Lessons of all types involve many opportunities for practice and repeated exposure to target language structures.
Duolingo courses are aligned to the Common European Framework of Reference (CEFR), an international standard for language proficiency (Council of Europe, 2001). The CEFR guides curricular development by focusing on communicative functions, that is, what learners actually are able to do with a language, such as asking for directions or ordering a cup of coffee.
F I G U R E 1 Example Duolingo course structure (left), example sentence-building exercise type (middle), and example story (right) [Color figure can be viewed at wileyonlinelibrary.com] 2.4.2 | Research questions As reviewed above, separate sets of studies have investigated the effectiveness of commercial online language learning products and university language programs. However, no direct comparisons have been made between proficiency outcomes of these two educational environments as the current study aims to do. The study also aimed to address some of the issues identified in the review of previous research on online language learning, such as collaboration between external academic scholars and internal researchers and the use of established proficiency measures. In particular, the current study investigated the following research questions: 1. What levels of reading and listening proficiency did Duolingo learners achieve upon reaching the end of the beginning-level Spanish and French courses? (RQ1) 2. What were the properties of learners' in-app activity-in terms of time spent studying, leveling up, and specific Duolingo features used-before reaching the end of the beginning-level course? (RQ2) 3. How did Duolingo learners' reading and listening proficiency scores compare with proficiency outcomes of US-based university students in Spanish and French courses based on ACTFL reading and listening proficiency tests? (RQ3) 3 | METHODS

| Participants
The participants of the current study were 135 Spanish learners and 90 French learners using the Duolingo product. They were learners who: (1) were at least 18 years old; (2) had an IP Address in the United States; (3) had self-reported no or little prior proficiency in the target language; (4) reached the end of Section 5 of the course; (5) reported using Duolingo as the only tool to learn the target language; 2 and 6) had proper computer equipment for online testing (see further information on the recruitment procedures below). A combination of program-recorded data and response to background survey questions was used to select participants who met all these criteria and who voluntarily and independently chose to use Duolingo to learn French or Spanish. With regard to self-reported no or little prior proficiency in the target language, only learners who reported prior proficiency of 0-2 on a 0-10 scale were included, with 0 meaning "I have no knowledge of the language at all," and 10 indicating "I have perfect knowledge of the language." Note that Duolingo collects this information from all learners when they reach the first checkpoint for the purposes of learner analytics and not for course placement. Demographic and other background information were also collected through the survey. Some general characteristics of the participants include the following: Among 210 participants who reported age, it ranged from 18 to 83 with a mean of 43.99 (SD = 15.54). In terms of gender, 49% of the participants identified themselves as male and 48% as female. Seventy-eight percent of the participants listed their ethnicity as Caucasian, 13% as Asian, and 3% as African American. Thirtynine percent of the participants reported having a bachelor's degree as their highest level of education, 37% having a master's degree, and 14% having a doctoral degree. Finally, 74% of the participants reported speaking only English before age 6; 8% were early bilingual speakers of English and another language; and 18% of the participants did not speak English before age 6 (their first languages varied widely and none of them were heritage speakers of the target language). For a more detailed bycourse description of participant background information, see Appendix A.

| ACTFL LPT and RPT
The ACTFL LPT and RPT were used as the main data collection instruments. The ACTFL LPT and RPT are standardized tests for the global assessment of reading and listening ability (ACTFL, 2013(ACTFL, , 2014. They measure how well test-takers spontaneously comprehend the texts and discourse they read or listen to as described in the ACTFL 2012 Proficiency Guidelines. ACTFL has 10 levels in its proficiency rating scale, from low to high in the order of Novice (low, mid, high), Intermediate (low, mid, high), Advanced (low, mid, high), and Superior. For the purpose of this project, Form E of the tests was used, which targets proficiency levels between Novice Low and Advanced Low. The tests, paid for by Duolingo, were administered to each participant online by a remote human proctor employed by ACTFL/Language Testing International. The participant was asked to read or listen to 15 passages and answer three multiplechoice questions after each passage. Each test was given an ACTFL rating immediately after the test was submitted. ACTFL ratings were coded numerically by following the 1-10 point scale as in previous studies (e.g., Rubio & Hacking, 2019;Tschirner, 2016;Winke et al., 2020). See Table 3 for the mapping between the point scale and each proficiency sublevel.

| Background survey
The questionnaire included sets of questions related to language background, demographic information, self-assessment of proficiency development, feedback about the Duolingo product, and a set of questions for participant selection mentioned earlier. The questionnaire can be found in Appendix B.

| Data collection procedures
Data collection took place during May-July 2020. Learners with an IP Address in the United States and a prior proficiency of 0-2 were contacted with an e-mail when they reached the end of Section 5 of the Spanish or French Duolingo course. In the e-mail, they were invited to participate in a research study and were encouraged to submit the background survey. They were selected to participate in the study if their responses indicated that (1) they did not take classes or use other programs/apps to learn the target language during the period of learning on Duolingo and (2) they had access to proper equipment for taking the test.
Participants completed one ACTFL proficiency test at a time, with the order of tests (reading and listening) randomized across participants. Each time a test was ordered, the participant received an e-mail from Language Testing International (LTI) with their test ID and instructions about how to schedule a time for the test. After they finished the first test, the second test was ordered for them and they were again contacted by LTI to take the second test. They went through the same process to schedule and take the test. Each participant received $100 from Duolingo after completing both tests. Table 4 shows the funnel for data collection.
A few participants did not take both tests. Among a total of 135 Spanish-learner participants, 132 reading and 131 listening scores were collected. Among a total of 90 French-learner participants, 88 reading, and 89 listening scores were collected.

| Analyses
Descriptive statistics were calculated to answer the first and second research questions on the proficiency outcomes of Duolingo learners and their in-app activity until reaching the end of the beginning-level content. For the third research question on the comparison of proficiency outcomes between university students and Duolingo learners, t tests were carried out for each language skill with the R statistical package (R Core Team, 2020).

| Proficiency outcomes of Duolingo Learners
The reading and listening proficiency ratings of Duolingo learners who participated in the current study are presented in Figure 2. The ratings in Spanish reading, French reading, and French listening were normally distributed; however, the ratings in Spanish listening were positively skewed. Twothirds of the Spanish listening proficiency ratings were at the Novice level.
On the basis of the numerical coding of the proficiency ratings on a 1-10 point scale presented in Table 3 above, Table 5 presents the summary data with mean scores and standard deviations. Overall, Spanish and French reading scores were between IL (4) and IM (5), while listening scores were at least one level below reading scores. Spanish listening was approaching NH and French listening was at NH.

| In-app activity of Duolingo learners
The reading and listening proficiency scores demonstrated the extent of target language development that occurred in the beginning-level Duolingo Spanish and French courses; however, another aspect of efficacy is how efficient the learning process is. To understand the degree of efficiency of the Duolingo Spanish and French courses, the amount of time Duolingo learners took to reach the end of the beginning-level course content was calculated. The total number of hours that the study participants spent in all Duolingo sessions in the given course were computed and summarized in Figure 3. This calculation is documented in Appendix C. The mean number of hours that learners across the two courses spent studying on Duolingo was 141 (median: 112). French learners spent on average about 20 h less than the Spanish learners to finish the beginning-level course, which is likely due to fewer course skills in French, as reported in Table 2 above. Learners also varied considerably in the number of days elapsed between their first lesson on Duolingo for the target language and participation in the current study; on average, 562 days passed for Spanish learners (median = 412 days, SD = 551 days) and 634 days passed for French learners (median = 359 days, SD = 707 days). The number of days in which the learners used the app during these periods, however, varies immensely across the sample.
As expected, a high degree of variation exists in the amount of time learners spent learning on Duolingo (Spanish: SD = 118, IQR = [44-213]; French: SD = 115, IQR = [39-192]). Due, at least in part, to this variation, very small and nonsignificant correlations (Spearman's ρ) between time spent using Duolingo and test scores for either Duolingo course were observed (see Figure 4). Variation in time spent learning on Duolingo is expected due to low minimum requirements to progress through sections of the course. While each course skill has five difficulty levels, learners were required to complete only one of those levels to move on to the next row. Some learners reached the fifth difficulty level in all skills while others did the minimum to move along the course, thus leading to large between-participant differences in the number of hours spent learning on Duolingo. Furthermore, this time spent learning measure potentially spans many years of study; some participants may have completed fewer hours more recently, while others completed many hours spanning several years. These differences in participant behavior-and the resulting variation in the time spent learning measure-make it difficult to draw conclusions about the relationship between total time spent learning on Duolingo and learning outcomes measured by the ACTFL assessment. Future studies could address these issues and provide stronger signals about this relationship by using a preand posttest design with more control over the time spent learning over the course of the study.
On the days the learners chose to study (restricted to days between starting and completing Section 5, the final section before qualifying for study participation), they completed around eight lessons on average (Spanish: mean = 8.5, median = 6.9; French: mean = 7.9, median = 6.3).
F I G U R E 2 Distribution of ACTFL proficiency ratings of Duolingo learners. The x-axis shows ACTFL rating acronyms (see Table 3 However, as with overall time spent learning, considerable variation was observed here (Spanish: SD = 7.0; French: SD = 5.1; see Figure 5 for full distribution). Learners also varied in the number of days taken to complete Section 5 (Spanish: mean = 81.2, median = 64.5, SD = 68.9; French: mean = 90.7, median = 75.5, SD = 62.6; see Figure 5 for full distribution). As noted in the Introduction, Duolingo courses are broken into "skills" that target certain vocabulary and/or grammatical concepts. Each skill includes five difficulty levels and learners are required to complete all skills in a given section at the lowest difficulty level (Level 0) before they can complete the Checkpoint for that section. Aside from this requirement for progressing to new sections, Duolingo learners are free to study however they want; some learners choose to focus on exploring new content (e.g., Level 0 lessons) while others study up on more familiar content (e.g., Level 1+ lessons). Due to this freedom, considerable variation was seen in the types of lessons that participants in this study completed ( Figure 6).
The distributions in Figure 6 shows that for many participants, the majority of lessons completed are Level 0; for more than 20% of participants (Spanish: 22.9%; French: 21.3%), Level 0 comprised at least half of all lessons completed. Other participants spent more time "leveling up" by studying with more difficult exercises. Bimodal distributions for Levels 2-4 sessions were observed, which means some users rarely "leveled up" (i.e., those focusing mainly on Level 0) and others spent time completing higher-level lessons.
Outside of standard lessons, many participants also completed Stories, which provide learners with discourse-level listening and reading practice. On average, 8%-10% of lessons completed by the participants were Stories (Spanish: 8.5%; French: 9.9%). Participants spent relatively little time completing "practice" lessons; on average, fewer than 4% of lessons completed were either practice type (Spanish: 3.8%; French: 3.7%)

| Comparison with university courses
The third research question of this study was to compare the proficiency outcomes of Duolingo learners with the outcomes of US-based university students in language courses provided by the foreign language proficiency test data (Winke et al., 2014(Winke et al., -2017. Although the learner F I G U R E 5 Distribution of Duolingo lessons completed per day (left) and number of days taken to complete Section 5 (final course section before ACTFL testing; right) [Color figure can be viewed at wileyonlinelibrary.com] populations may be vastly different, with a much greater homogeneity in ages among the university-based learners, it is informative to establish correspondences between learner proficiency outcomes across distinct educational environments.
As mentioned earlier, the foreign language proficiency test data (Winke et al., 2014(Winke et al., -2017 include assessments at various semesters of undergraduate study. The Spanish data include scores from second to eighth semester (except for the seventh semester) and the French data include scores from second to eighth semester (except for the fifth semester). On the basis of the performance of Duolingo learners reported above (see Table 5), a statistical comparison between Duolingo learners and fourthsemester university students was conducted. The fourth semester is the highest level in most university basic language programs before the traditional-if somewhat antiquated-"bridging" occurs into courses for language majors and minors (Graman, 1997) and is often used as the criterion for meeting degree-or university-based language requirements. The ACTFL ratings in the university data were coded numerically in the same way as the Duolingo data based on Table 3. Table 6 summarizes the descriptive statistics as well as the results of a series of t tests comparing the Duolingo and university learner performance.
To assess whether there were significant differences between Duolingo learners and university fourth-semester students, separate Welch two-sample t tests on each of the four sets of scores were carried out. No significant differences and small effect sizes (d; see Plonsky & Oswald, 2014) were found on Spanish listening (t = −1.74, p > .05, Cohen's d = −0.24), Spanish reading (t = 0.35, p > .05, Cohen's d = 0.04), and French listening (t = 1.41, p > .05, Cohen's d = 0.21), which suggests that Duolingo learners were not significantly different compared with university students at the end of their fourth semester. A significant and moderately sized difference for French reading was found (t = 4.36, p < .05, Cohen's d = 0.72), which showed that Duolingo learners performed significantly better than university students at the end of their fourth semester.
To show how Duolingo proficiency scores align with semester-based university data, second-to sixth-semester data from US university students were included in Figure 7. Please note that fifth-semester French data were not available in the university data set.

| Summary of findings
This study assessed the reading and listening proficiency of Duolingo learners who had completed the beginning-level material in the Spanish and French courses, analyzed their in-app activities, and compared their proficiency scores to those of fourth-semester university students on the same measures. The study aimed to answer three research questions. The first question asked about the levels of reading and listening proficiency that Duolingo learners achieved upon reaching the end of the beginning-level Spanish and French courses. Complementary to RQ1, our second research question was concerned with learners' in-app activity such as time spent studying, leveling up, and the specific Duolingo features they used en route to reaching the end of the beginning-level course. RQ3 inquired about how Duolingo learners' reading and listening proficiency scores compare with the proficiency outcomes of US-based university students in Spanish and French courses.
To answer the first research question, the results indicated that Duolingo learners who had completed the beginning-level material in Spanish reached IL in reading (according to ACTFL RPT) and approached NH in listening (according to ACTFL LPT), while learners studying French approached IM in reading and reached NH in listening. The current study was designed to address limitations in previous efficacy research for online language learning platforms, such as insufficient involvement from independent researchers and lack of rigor in the instruments used to measure proficiency. Due to study design and instrument differences for other research on the efficacy of online language learning platforms, comparison to these studies is difficult. However, the current results demonstrate the ability of online language learning platforms to teach to intermediate-level proficiency in reading and advanced novice-level in listening. Future studies will assess learner proficiency in other core competencies, such as speaking and writing, and investigate additional gains afforded by more advanced content.
Language learning apps are thought to be good for developing decontextualized linguistic knowledge and Duolingo is considered one example of such apps (Krashen, 2014). Although beginning-level Duolingo lessons focus on vocabulary and grammar at the sentence level, the findings demonstrated that learners were able to transfer discrete linguistic knowledge to integrative tasks such as reading and listening comprehension. This type of knowledge transfer and integration was also evidenced in Loewen et al. (2020), which shows that even with limited opportunities for oral production on Babbel, the explicit vocabulary and grammar knowledge that Babbel learners mastered led to encouraging gains in oral proficiency. The transfer of explicit linguistic knowledge is supported by Skill Acquisition Theory (e.g., DeKeyser, 2015), which states that practice and repetition can lead to proceduralization of explicit knowledge and hence improved language learning outcomes. With such findings in mind, Loewen et al. (2020, p. 19) proposed that the field of second-language acquisition should "recognize the pedagogical potential of widely used modern apps" and "abandon earlier characterizations of language learning apps as merely 'mechanical practice of selected and graded grammatical F I G U R E 7 Comparison of mean ACTFL proficiency test scores for Duolingo and the university study, with 95% confidence intervals. See Table 3 for the proficiency ratings shown on the y-axis [Color figure can be viewed at wileyonlinelibrary.com] phenomena… in the form of drills'" by citing Heift and Vyatkina (2017). The authors of this study concur with Loewen et al. on this proposal. However, of course, any claims of skills transfer among Duolingo learners would need to be tested empirically. Indeed, the rate of development does not seem to be the same for all language skills and we would emphasize, again, that the present study only reports gains made in the two receptive skills of reading and listening. Although the participants' reading and listening scores were moderately correlated, the listening proficiency of Duolingo learners was significantly lower compared to reading proficiency, which replicated the findings of Tschirner (2016) and Rubio and Hacking (2019) for university students. Although both listening comprehension and reading comprehension are receptive skills, the comprehension processes have been found to be mostly modality-specific (Wolf et al. 2019). For learners at early stages of language learning, listening comprehension demands a higher level of attention, exerts a heavier load on working memory, and requires the ability for speedy decoding and processing of transient audio input (see, e.g., Bloomfield et al., 2010;Wallace, 2020). In contrast, learners' decoding process in reading is facilitated by the availability of visually presented text (Spoden et al., 2020;Vandergrift & Baker, 2015). As a result, listening comprehension is often more challenging than reading comprehension for second language learners. Some researchers also attributed students' lower listening proficiency to insufficient attention to auditory input and exercises in classroom instruction and called for more emphasis on listening development in instructional practices (Tschirner, 2016).
Analysis for RQ2 demonstrated that the median amount of time that the participants took to complete the beginning-level material was 112 h (99 h for French learners and 125 h for Spanish learners). On the days that the participants chose to study content in the beginning-level course section, they completed eight lessons on average, with the majority of the lessons at Level 0, which is the lowest and required level to progress through the course. Substantial variation in time spent studying may explain, at least in part, the lack of correlation between assessment outcomes and total time spent, a finding that contrasts with those of Loewen et al. (2020). The self-directed nature of the Duolingo learning platform contributes to this variation and complicates our interpretation of how the amount of learning effort contributes to assessment outcomes; for example, we observed bimodal "leveling up" behavior, where some learners choose to complete more difficult skill levels while others rarely do. It appears that the quality of the time spent in terms of activities, lessons, and attention given, may matter just as much as the quantity of time spent using the app. Future studies could address these issues and provide stronger signals about this relationship by using a pre-and posttest design, which allows for more control over the time spent learning over the course of the study. Other studies have had success with this design (e.g., Loewen et al., 2020).
In comparing listening and reading proficiency between Duolingo learners and university students in language classes, the results indicated that the proficiency scores of Duolingo learners aligned with those of fourth-semester university students. Specifically, when Duolingo Spanish and French learners reached Checkpoint 5 at the end of the beginning-level course content on Duolingo, their Spanish reading, Spanish listening, and French listening proficiencies were comparable to what university students accomplished in four semesters of classes, while their French reading proficiency was significantly higher than fourth-semester university students. Previous studies have also compared proficiency following the use of online language learning products to university classroom outcomes. Lord (2015Lord ( , 2016 found similar levels of achievement for classroom learners compared to learners using only Rosetta Stone over the course of a semester. Similarly, Rachels and Rockinson-Szapkiw (2018) observed no significant differences between outcomes for third and fourth graders using Duolingo to learn Spanish and those who received classroom instruction. The findings of the current study-combined with those from previous research-provide evidence that online language learning products can be effective methods for learning an additional language, at least in reading and listening.

| Limitations and directions for future research
The findings of the current study do not represent the overall effectiveness of Duolingo or university language courses, so they should not be overgeneralized. Participants of the study were only compared on reading and listening skills while teaching effectiveness can be reflected in other skills and abilities. In addition, there were a number of differences between the participants of the study and the university student sample. The university proficiency project tested full-time university students from a more homogeneous age range, while the participants of the current study were more varied demographically and included mostly post-university older adults. Similarly, the participants' motivations for language learning could also be more varied than university students, who included both those studying to meet a requirement and some who would later declare majors or minors in the language. These differences may put into question the comparability of the learners and the learning that took place in these two very different settings. The availability of the university proficiency data made this comparison possible; however, the comparison between Duolingo learners and university students should not be interpreted as competition between online language learning apps and university language programs. The aim in comparing learning outcomes from the two contexts is, rather, as a means to benchmark the progress made by Duolingo learners relative to a more familiar and traditional setting.
The current study tested learners when they reached Checkpoint 5 independently. For future research, treatment studies with a pre-and posttest design will allow more control of learning time and participant factors that were self-reported in the present study, including prior proficiency, exposure to the target language outside of Duolingo, and the exclusion of other learning tools. This study focused on listening and reading proficiency, which are both receptive skills. Learners were not assessed in speaking (as in Rubio & Hacking, 2019) or writing. In subsequent studies, Duolingo's effectiveness in developing learners' productive skills will be evaluated as well. Doing so will provide a better understanding of whether and to what extent Duolingo learners' success in receptive skills generalizes to other skills.

| Pedagogical implications
The study indicates that using Duolingo as a tool to develop reading and listening proficiency may be at least as effective as developing these proficiencies in a university classroom through traditional pedagogies. Although Duolingo courses mostly teach vocabulary and grammar at the sentence level (with some longer-form content available in the form of short stories and podcasts), the results of this study also suggest that the seemingly discrete vocabulary and grammar knowledge can be applied to integrative tasks such as listening and reading comprehension.
The findings of the study indicate that learners who use Duolingo as a tool for the selfdirected study show substantial proficiency development. As we might expect, the usage data from the present study indicates a very slightly positive relationship between learners' total hours spent using the app and their reading and listening scores. In other words, more time on the app is associated with greater gains. However, Duolingo app usage data also points to vast variability in the time (hours) and intensity of learning that participants took to complete the first five sections of their course. Consequently, it would be premature to make any suggestions regarding when and how the app might be used to maximize its efficiency. However, we plan to address this question in a future study.
In addition to self-directed learners, classroom teachers have used Duolingo to their advantage and benefited their students (Munday, 2016(Munday, , 2017, suggesting that the app is also a useful tool to complement other types of language instruction. For instance, if vocabulary and grammar practice can be largely done by students as homework using apps such as Duolingo, more class time can be directed toward the teaching of culture and other communicative skills.

| Conclusion
This study assessed the reading and listening proficiency outcomes of Duolingo learners who had little to no prior knowledge of the target language and used Duolingo as the only learning tool. The findings demonstrated that learners who finished the beginning section of the Duolingo Spanish or French course reached IL in reading proficiency and NH in listening proficiency. These proficiency scores of Duolingo learners were comparable with the proficiency outcomes of students at the end of the fourth semester in university-based language programs (Rubio & Hacking, 2019;Tschirner, 2016). In conducting this study, we hope to have shed light on the potential effectiveness and comparability of Duolingo, as measured through standardized tests, to more traditional settings. Future studies will continue to build on our findings at other levels of study, in other linguistic domains, and in other target languages.

ENDNOTES
1 Duolingo offers all learners free access to the entirety of its instructional materials. Learners can optionally purchase a subscription, Duolingo Plus, but the subscription does not give access to any additional educational content. Instead, Duolingo Plus offers an ad-free experience, the ability to download lessons for offline use, and other gamification features.