PROTOCOL: Language of instruction in schools in low‐ and middle‐income countries: A systematic review

Abstract To address the evidence gap in making effective language of instruction (LOI) decisions, we propose a systematic review of the role of LOI choices in education programs and policies on literacy outcomes in multilingual educational contexts in low‐ and middle‐income countries (LMICs). Grounded in a multidisciplinary theory of change (ToC) describing what factors link LOI choices and literacy outcomes, we will gather, organize, and synthesize the evidence on the specific role of the three LOI choices described in the ToC (teaching in mother tongue [MT] with later transition, teaching in a non‐MT language, or teaching in two or more languages at one time) and its impact on literacy and biliteracy outcomes. We will focus our systematic review and meta‐analysis only on quantitative and qualitative intervention studies from LMICs as these have the highest relevance for decision making in multilingual LMIC contexts. We will also only include languages that are relevant and commonly spoken in LMICs. For example, we will likely include studies that examine Arabic to English transfer, but not Arabic to Swedish transfer.

multilingual education programs are multifaceted, including higher likelihood of girls and marginalized communities staying in school (Benson & Hakuta, 2005), increasing educational equity and maintenance of cultural and linguistic diversity (Ball, 2010), allowing parents and communities to participate in the learning process (Nag et al., 2019) as well as long-term cost benefits (Heugh, 2004(Heugh, , 2012. There are also clear cognitive benefits to learning to read in a known or familiar language, as the skills from the first language transfer and facilitate learning to read in a new language (Chung et al., 2019;Koda, 2008). Furthermore, strong bilingual education models have significant positive effects on non-linguistic functions (Bialystok, 2005) and executive function skills (Bialystok, 2018) that lay a strong foundation for later socioemotional skills, as well as on academic achievement (Collier & Thomas, 2017).
At the same time, there is an ever-increasing demand from communities for education in the national or international postcolonial language (Coleman, 2011). The primary reason for this demand is the link between the postcolonial language and socioeconomic mobility (Azam et al., 2010). Other factors that complicate LOI choices include linguistically heterogenous classrooms, in which there are multiple MTs in one school or area (Nakamura et al., 2017;Reddy, 2011); as well as the fact that some MT languages have no scripts, lack teaching and learning materials, have limited trained teachers, or lack political or community will to be implemented as languages in education (Piper et al., 2016;Trudell & Piper, 2013).
This leads to a situation in which decision-makers must reconcile both the well-documented benefits of MT instruction along with the quest for socioeconomic mobility through a postcolonial or international (later acquired) language at earlier grades. Therefore, this systematic review will focus on LOI choices in education programs and policies on student literacy outcomes in multilingual contexts in LMICs. In particular, we ask the question of whether or not MT (or familiar language) instruction impacts reading outcomes, as well as aim to investigate the unanswered question of when to introduce or transition to additional languages of instruction to foster quality bilingual or multilingual reading outcomes.
1.2 | Theory of change for LOI policies and programs on literacy outcomes

| Theory underlying bilingual and multilingual literacy acquisition
To base our theory of change in theory (Brown, 2020), we developed a learning science 1 framework of the cognitive mechanisms that underpin literacy learning in bilingual and multilingual learning contexts. This provides a theoretical basis for how we expect LOI policy and program interventions to impact literacy outcomes in LMIC's.

The Cognitive Foundations of Reading and its Acquisition (CFRA)
is a model that lays out the cognitive components required for successful reading in monolingual learners-and links those components to curriculum effectiveness and reading teacher knowledge and teaching effectiveness (Hoover & Tunmer, 2020). The Peter Effect in teaching reading is rooted in the principle that it is not possible to give what one doesn't have (Applegate & Applegate, 2004;Binks-Cantrell et al., 2012). Studies show that reading teacher effectiveness is significantly related to their own reading enthusiasm (Applegate & Applegate, 2004) as well as their own knowledge of the cognitive foundations of reading (Binks-Cantrell et al., 2012).
Learning science theories from various disciplines such as psychology and linguistics reveal that the underlying mechanisms of acquisition of reading skills in bilingual and multilingual learners is different from learning to read in monolingual learners in significant and predictable ways. First, learning to read in a second or later acquired language (referred to henceforth as L2/x) is significantly impacted by transfer of reading skills from a first language (L1) 2 .
Second, L2/x learning is also significantly impacted by L2/x oral language skills, which are highly variable in L2/x learners compared to monolingual learners. This notion that L2/x reading skills are reliant on a combination of L1 reading skills and L2/x oral language skills is encapsulated in the linguistic interdependence hypothesis, the underlying proficiency hypothesis (Cummins, 1979(Cummins, , 1981 and the transfer facilitation model (TFM) of second language reading (Koda, 2005(Koda, , 2007(Koda, , 2008. Chung et al. (2019) provide an updated interactive framework for crosslinguistic transfer in L2/x reading, in which they posit that the relationship between L1 and L2/x reading skills is influenced by cognitive, linguistic, and metalinguistic factors such as language specific versus language neutral constructs, L1-L2/x distance, L1-L2/x proficiency and complexity. They extend the model to postulate that transfer is also impacted by socio-cultural factors such as age of beginning acquisition of the L2/x, immigration experience, educational settings, and extent of exposure to the L1 and L2/x.
Indeed, empirical evidence is accumulating for each of these factors. In a meta-analysis of the cognitive and linguistic sub-skills in transfer, Melby-Lervåg and Lervåg (2011) find that phonological awareness and decoding skills present significant correlations across L1-L2/x; but that these relationships are less (or not) present in oral language comprehension and reading comprehension subskills.
Reflecting this need to start with a foundation in the L1 for successful outcomes in the L2/x, Collier and Thomas (2017)  Learning science is the field of study that aims to 'better understand the cognitive and social processes that result in the most effective learning, and to use this knowledge to redesign classrooms and other learning environments so that people can learn more deeply and more effectively' (Sawyer, 2006, p. xi). 2 We refer to L1 as the first language(s) of the child or any language(s) the child uses and understands with high levels of familiarity and proficiency, oftentimes called 'mother tongue' (MT). We refer to L2/x as the second or later acquired language(s) of the child, that is, any language(s) they are learning and have emerging levels of familiarity and proficiency with.  (Senechal & Le Fevre, 2002. Cross-country reviews also highlight that parental attitudes towards reading, number of books at home (indirect HLE factors) and literacy-linked activities at home (direct HLE factors) have a significant impact on reading outcomes (Park, 2008). Given the vast mismatches between home and school language in LMIC's (Nag et al., 2019), as well as the generally lower rates of adult literacy (and thus parental literacy) in LMICs (Abadzi, 2003), the evidence underscores the additional risk of home literacy and language environments that do not have the resources necessary to support reading development in the language of the school-or any language (Nag et al., 2019). However, studies also suggest that there is context-specificity in the relative importance of various dimensions of the home language and literacy environment on specific reading component outcomes (Friedlander, 2020;Nag et al., 2019;Park, 2008) Classroom and teacher factors such as attendance (of both teachers and students), the lack of a safe learning space (Spier et al., 2019), nutritional inputs (Plaut et al., 2017), and availability of print (through digital media or not) are necessary factors for learning-however, they are not sufficient (Snilstveit et al., 2016).
Pedagogical inputs (such as structured learning progressions, skill-based learning, teaching and learning at the 'right' level) and teacher professional development are emerging as the most effective ingredients for translating access and safe learning spaces into quality learning outcomes (Evans & Acosta, 2020;Evans & Popova, 2015). In fact, Evans and Acosta (2020) underscore MT instruction programs as one of the most effective elements of pedagogical interventions.
Individual differences, such as age and socioeconomic status also are known to moderate the relationship between teaching inputs and language and literacy outcomes. Although it is clear that language learning abilities decline as individuals get older (Flege et al., 1999), there does not seem to be any conclusive evidence that there is a biologically based point at which a child's (or individual's) ability to learn a new language diminishes at a significantly higher rate than before such a point (Bialystok & Miller, 1999;Bialystok, 1991;Birdsong & Molis, 2001;Hakuta et al., 2003). Neurobiological studies reveal that age of acquisition does not alter the underlying brain structure of bilinguals (Frenck-Mestre, 2016;Friederici et al., 2002).
However, more intuitively, differential aspects of language learning (such as phonological and grammatic processing) are more susceptible to age-associated declines than others, such as semantic processing (Abutalebi et al., 2001;Hernandez et al., 2000;Weber-Fox & Neville, 2001 (Bialystok, 2018). Researchers in the West stress the critical difference between 'bilingual education' (positive, additive connotation) and 'education of bilingual children' (negative, subtractive connotation) (Bialystok, 2018). This distinction manifests itself in policies that either embrace bilingualism and multilingualism as a force for global integration versus those that use local language as steppingstones towards a different national or international language, which in turn contributes to motivation to learn and community and family involvement in the education system.
Taken together, these studies help us move towards the development of a middle range theory on multilingual education and biliteracy acquisition. However, it is still unclear how to construct an effective LOI policy, beyond noting that teaching a child in a language they are familiar with is critical for learning. There is little understanding of the mechanism of transfer of skills from one language to another, the 'right' timing or skill level at which a child is most likely to benefit from learning a new language, or how to foster quality bilingual/multilingual outcomes after the initial year(s) of MT interventions that aim to improve literacy skills may be effective.
In Figure 1, we present our logic framework. We begin with the key assumption that the child has access to a learning program.
Access can occur in the form of school infrastructure with teachers who do not use any technology, or a blended learning environment within a school or community building where teachers or teaching assistants may use some technology to enhance learning (e.g., the eSchool 360 model implemented by the Impact Network in Zambia), or an online/digital learning program that is-or could be-facilitated by a remote teacher or guide (e.g., Mindspark software), or an entire online/digital learning program that is selfguided or guided by a virtual built-in guide (e.g., the Google Bolo app). Another assumption underlying our logic framework is that teachers are willing and able to learn and change how they teach in line with new curricula, teaching materials, and pedagogies tailored to bilingual or multilingual students and varying language types.
The introduction of revised LOI choices cannot be effective or adequately applied within the classroom without a revision in the teaching and learning materials to reflect this change. Further, since not all teachers may be fully fluent in the LOI choices or know how to teach those language(s), some may be required to obtain additional trainings to improve teaching knowledge or be re-assigned to schools where they can teach language(s) they are fluent in and trained to teach reading in.
LOI transition intervention activities can be manifested in many ways. For the purposes of this research, we operationalize LOI transition programs and policies as those that have any one or any combination of the following components: (1) An education program or policy that is implemented in a MT or local language, which will then lead to a transition to (complete change) or addition of (adding as a subject or dual language instruction) teaching of a new language, which may or may not occur during the course of the program. These are important to examine as the skills taught and learned during the course of the program will have significant implications for 'readiness' of transfer to the new later acquired language.
(2) An education program or policy that is implemented in a language that is not the child's 'own' language (i.e., a language the child has enough proficiency to learn in). These programs are important to investigate as they constitute a child 'transitioning' out of their own language into an education system in a new language right from the start of education.
F I G U R E 1 Logic framework for language of instruction (LOI) policies and interventions on literacy outcomes.
(3) An education program or policy that is implemented in which students transition from one LOI to another or add a new language as part of the medium of instruction during the course of the program or policy.
In any of these scenarios students' learning acquisition process can be impacted by the language used for teaching-and as such, can have significant implications for effectiveness of learning to read. This is regardless of whether the LOI transition is the key component of the educational program. 3 Furthermore, the program or policy intervention is most likely to succeed in improving learning outcomes if it has standards, curriculum, and trained teachers (the latter for classroom-based instruction, as opposed to technology-based instruction) that focus on the cognitive foundation models (CFRA) (Hoover & Tunmer, 2020) and/or the interactive models of reading transfer (Chung et al., 2019).
Although several programs may not explicitly have these theoretical frameworks named in their models, this is based on the theoretical premise that a curriculum or a teacher cannot give what they do not have (Applegate & Applegate, 2004;Binks-Cantrell et al., 2012).
These programmatic components (or activities) may improve the quality of teacher (or technology) knowledge and practices, increase child's motivation to learn (as this will maintain teaching at the 'right level', Banerjee et al., 2016;Pritchett & Beatty, 2015), and also increase parental and community involvement in the child's education (Benson & Hakuta, 2005). Finally, all these will improve the effectiveness of the LOI decision, leading to impacts on the child's L1 literacy skills (reading or decoding based skills as well as oral language skills) and the child's L2/x literacy skills.
We also examine the role of several possible factors that are likely to moderate the likelihood that the intervention will improve literacy skills, including community demand for the L1 versus the L2/ x, local and national policies supporting the implementation of the program or policy, socioeconomic status, parental literacy/schooling level, language(s) spoken at home, home literacy environment (exposure to print), child's initial language use and proficiency level in language(s) of the school, gender, and disability.
Given that LOI choices touch several aspects of the education system, we aim to synthesize the linkages between the inputs to trace how components of LOI programming may impact different sections of the system. For instance, inputs in curricular choices in terms of timing and sequencing of skills in each language would impact standards and curriculum development decisions; whereas teacher training inputs would impact professional development modules, assignment of teachers to schools based on languages they speak versus languages students speak, and urban-rural teacher availability. All practice and policy recommendations will be interpreted within the theory of change, to further develop a middle-range theory for LOI decision making in LMICs that is reflective of both the micro psycholinguistic and learning science ingredients in improving learning outcomes as well as the macro sociolinguistic, socioeconomic, and political environment within which LOI policy and practice decisions are being made.

| WHY IT IS IMPORTANT TO DO THE REVIEW
This systematic review will aim to help decision makers-ministries of education, teacher training institutes, community leaders, interested We will also examine evidence gaps that may hinder efforts to implement successful LOI policies in multilingual LMICs. We will focus particularly on the ministry of education in Ethiopia to guide the implementation of the country's new three language policy that is part of the education roadmap reform being developed for rollout in the near future. Our conversations with key stakeholders in the Ethiopian education system, including the Ministry of Education team that is tasked with developing the roadmap development suggest there is an urgent need to gather and understand the evidence on how to implement this multilingual LOI policy effectively.
The generalizability of the findings across language types is informed by the framework that all writing systems of the world are divided into four main types (Nag & Perfetti, 2014;Perfetti, 2003): alphabetic, syllabic, alphasyllabic, and morphosyllabic. Although we will examine various local contextual factors that may hinder or facilitate the impact of the program, including linguistic complexities on the dimensions of orthographic depth, orthographic breadth, graphic complexity in both/all languages, we will be able to consolidate the findings for a broader middle-range theory for each of the four writing system types.
This study builds on recent systematic reviews that have shown that MT instruction is critical for learning quality (Evans & 3 Where possible, we will isolate the impact of LOI from impacts of other programming features such as learning materials, teacher professional development, and so forth.
However, the likelihood of finding studies within our search for which we are able to do so is low as LOI policies are rarely, if ever, randomly assigned or assigned in isolation.
NAKAMURA ET AL.
| 5 of 16 Acosta, 2020; Nag et al., 2019); but is unique in that it will be the first to systematically review the evidence on how and when to add-or transition from-one LOI to another. Furthermore, the study utilizes a combination of methods, including critical discourse analyses to map policy and practice documents to the evidence generated from the systematic review. Finally, by exploring various psycholinguistic underpinnings of reading and examining a variety of sociolinguistic contexts of learning and drawing from our multi-disciplinary theoretical framework, it helps build middle-range theory on the mechanisms that may explain why certain LOI policies are likely to be more effective than others.

| OBJECTIVES
To address the evidence gap in making effective LOI decisions, we propose a systematic review 4 of the role of LOI choices in education programs and policies on literacy outcomes in multilingual educational contexts in LMICs. Grounded in the multidisciplinary theory of change described above of what factors link LOI choices and literacy outcomes, we will gather, organize, and synthesize the evidence on the specific role of the three LOI choices described in the ToC (teaching in MT with later transition, teaching in a non-MT language, or teaching in two or more languages at one time) and its impact on literacy and biliteracy outcomes. We will focus our systematic review and meta-analysis only on quantitative and qualitative intervention studies from LMIC's as these have the highest relevance for decision making in multilingual LMIC contexts. We will also only include languages that are relevant and commonly spoken in LMIC's. For example, we will likely include studies that examine Arabic to English

| METHODOLOGY
In this section we provide detail on the methods that will be employed to answer our primary research question.
4.1 | Criteria for including and excluding studies

| Types of study designs
The primary research question on the effectiveness of interventions will be addressed using quantitative experimental or quasiexperimental as well as qualitative studies that include a programmatic or policy intervention.
Specifically, we will include the following study designs for quantitative studies: (1) experimental designs using random assignment to the intervention and (2)  We will include studies with data collected at the individual level to ensure that the study focuses on child-level learning outcomes.
We will include each of the multivariate quasi-experimental methods to maximize the external validity of the systematic review.
However, several of the quasi-experimental studies we propose to include may include only OLS regression analysis and, therefore, may not be able to provide unbiased impact estimates. In such cases, we exclude these studies from our meta-analysis. To mitigate concerns about internal validity of some of the included studies, we will conduct a risk of bias assessment and stratify our meta-analysis by identification strategy, where feasible, as in Brody et al. (2015). This stratified meta-analysis will enable us to assess the internal validity of the included studies with a high risk of bias by comparing the impact estimates in those studies with the impact estimates in studies with a low risk of bias (Chinen et al., 2017).
For qualitative studies, we will include any intervention studies that utilize the following illustrative methods: (1) case studies, (2) focus group discussions; (3) key informant interviews; and (4) observations of classrooms or community language use. At the abstract screening stage, we intend to include all qualitative studies that examine an intervention, regardless of methodology. Depending on the number of studies that are returned from the abstract 4 We follow the definition of systematic reviews and evidence syntheses from the Campbell Corporation: 'a systematic review is an academic research paper… that uses a method called evidence synthesis to look for answers to a pre-defined question'. screening stage, we will either select only those qualitative studies that are linked to a quantitively examined intervention, to explain why or how that particular intervention may or may not be effective; or select all qualitative intervention studies for full-text review, which will provide a fuller picture of how and why multilingual or MT instruction programs may or may not be effective in LMIC's in general.

| Types of participants and settings for both quantitative and qualitative studies
We will include studies that focus on interventions that include primary and secondary school aged children in LMICs, as defined by the World Bank. 5 We will include studies about the effects of LOI choices regardless of the educational status or skill level of children at the time of the intervention. Only studies conducted between the years of 1995 and 2020 and published in English or Amharic will be considered.
In the case of qualitative studies, we will include intervention studies that have a focus on school-aged children in LMICs, as defined by the World Bank.

| Types of interventions
The interventions included in this review will be LOI choices made by educational policies and programs that directly aim to increase children's literacy in bilingual or multilingual LMIC education contexts. These interventions include programs with one or more of the following components: • Full early learning programs for MT education or bilingual and multilingual children intervention' or 'business as usual' is not selected, eligible comparison conditions will include students before a LOI policy change within a region, students within regions with different or no LOI transition policies within a country, and students in schools who are not in a MT or regional language program. In the case of qualitative studies, a comparison will not be necessary. 2. Student motivation: Defined as a child's motivation to attend school as well as to want to learn to read or engage with language, print, or stories. This will be measured in student's reading behaviours at home or in school and in the community, and attitudes towards reading.

Parental and community involvement: Studies have found that
there is a significant association between student's being taught in home and community languages and the parental involvement in 5 We are using the 2014 LMICs definition, which includes Argentina, Hungary, Seychelles, and Venezuela. These countries, however, were categorized as high-income countries in the July 2015 update. 6 Bilingual or multilingual programs are defined as any that have more than one language of instruction, have additional language subject classes, or the students are bilingual or multilingual, or students are in mixed-language/linguistically heterogenous classrooms NAKAMURA ET AL.
| 7 of 16 the education system (Benson, 2007). We define this outcome as the frequency and quality of interactions between the parents and/or community members and teachers, as well as the amount of time spent by teachers involved with student's learning at home (helping with homework, learning from the student, supporting the student with their learning, asking questions about school, etc.)

Final Outcomes:
We provide operational definitions for our final outcomes from a range of research on reading development that looks at reading across language and orthographic types as well as across L1/L2 learning status (Koda & Zehler, 2008;USAID et al., 2019;Verhoeven & Perfetti, 2017). Each of these skills will be considered for both L1 and L2/x: 4. Sound-symbol correspondences: Oftentimes called letter naming or letter knowledge, sound-symbol correspondence skills refer to the ability to see a single printed letter, akshara, or character and be able to sound out the symbol.

Decoding:
The ability to see a printed word or cluster of symbols and sound out the word or cluster of symbols. There are several paths that learners take to acquire this skill, but our primary concern will be on whether or not students are able to reach the entire phonological representation of the printed word, regardless of which path they take to achieve the skill.
6. Oral reading fluency: The ability to sound out a short passage or story with accuracy, speed, and prosody.
7. Reading comprehension: The ability to comprehend both explicit and implicit information presented in single or multiple phrases or sentences of text.
If data are not available for each of these subskills separately, based on the CFRA (Hoover & Tunmer, 2020), we will create composite scores for the emergent literacy and oral language measures (#1-3 above), and for the decoding scores (#4-6 above), and for the reading comprehension scores (#7). Understanding that there are usually only four questions on many EGRA scores for reading comprehension, we will consider either removing these items or merging them with the decoding scores for reliability, if necessary.
We will include only literacy outcomes even if the study looks at scores on other subjects.
Outcome measures will not be considered to filter qualitative studies, which will serve to address the secondary research question.

| Search strategy
We developed a search strategy in consultation with an information specialist. Our search strategy will enable us to identify relevant published and unpublished literature by focusing on relevant academic and institutional databases, citation tracking, and snowballing of references. We identified the following literature searches.

| Electronic sources
Comprehensive database searches will include the following paidaccess and free-access electronic databases: 18. Additional key papers identified from institutional websites.

| Screening phase 1
After our initial search is completed, we will conduct a manual abstract review process. Each abstract will be reviewed independently by two trained reviewers. We will conduct the following steps In the second phase, we will review the full text of all studies that pass Phase 1 screening. Multiple reviewers will independently identify and confirm the following information for each study:

| Quantitative studies
• Two team members with expertise in quantitative research will work independently to extract information from each quantitative study included in the review. Both team members will use a data extraction form and fill the data from the extraction form in a table. We will resolve disagreements through discussion.

| Risk of bias assessment
We will determine the rigour of the quantitative studies using an adaptation of a set of criteria, to assess risk of bias in experimental and quasi-experimental studies (Hombrados & Waddington, 2012). While the risk of bias assessment is very labour intensive, the number of quantitative studies we expect to require this assessment is low. We will assess the risk of the following biases: 1. Selection bias and confounding, based on quality of identification strategy to determine causal effects and assessment of equivalence across the beneficiaries and nonbeneficiaries.

| Measures of treatment effects
In accordance with, Chinen et al. (2017); we will extract information from each quantitative study to estimate the standardized effect sizes (for continuous variables) or odds ratios (ORs) (for binary variables) across studies. We will also calculate standard errors and confidence intervals where feasible.
We will report effect sizes as Hedge's g and will adjust effect sizes reported as Cohen's d to Hedge's g. We will use Hedges' g effect sizes (sample-size-corrected standardized mean differences [SMDs]) for continuous outcome variables, which measure the effect size in units of standard deviation of the outcome variable, and for binary outcomes, we will calculate ORs.
The SMD using Cohen's d is calculated by dividing the mean difference with the pooled standard deviation by applying the formula in the following equation: where Y t refers to the outcome for the treatment group, Y c refers to the outcome for the comparison group, and S p refers to the pooled standard deviation.
The pooled standard deviation S p will be calculated by applying either of the following equations: where SD y refers to the standard deviation for the point estimate from the regression, n t refers to the sample size for the treatment group, n c refers to the sample size for the comparison group, and β refers to the point estimate. We will use Equation (2) for regression studies with a continuous dependent variable and Equation (3) when the study provides information about the standard deviation for the treatment group and the comparison group.
To transform Cohen's d into Hedge's g, we will use the small sample correction for the SMD by applying on the following formula: Lastly, Equation (5) For studies using linear probability models, we will assume linearity in the estimation of standardized effects as in Brody et al. (2015). For example, if we observe a mean baseline value for the comparison group of 0.097 and an effect size of 5.1 percentage points, then we will assume that the follow-up value for the treatment group would be 0.097 + 0.051 = 0.148, and we will assume that the follow-up value for the comparison group will be 0.097.
To correct the standard errors for studies where the outcome variable is clustered at a level above the individual or household, we will use adjusted standard errors by applying corrections to the standard errors and confidence intervals using the variance inflation factor (Higgins & Green, 2011): where m is the number of observations per cluster, and ICC is the intracluster correlation coefficient. We will estimate the ICC for each of the relevant outcome measures for our sample of included quantitative studies and for which we are able to access the data on the outcome measures.
When we are unable to retrieve the missing data, we will impute effect sizes and associated standard errors based on the t or F statistic or p values. We will use David Wilson's practical meta-analysis effectsize calculator to conduct such imputations. Where sample sizes for the treatment and the comparison group are not reported, we will assume equal sample sizes across the groups.

| Methods for handling dependent effect sizes
Where studies report more than one effect size on the basis of different statistical methods, we will follow the procedure as laid out in Chinen et al. (2017) and will select the effect size with the lowest risk of bias. Where studies report more than one effect size based on the same individuals, we will employ the robust variance estimation techniques to adjust for effect size dependency (Hedges et al., 2010).
When studies present multiple impact estimates for different variables measuring the same construct, we will use a sample-size weighted average to measure a 'synthetic effect size'. In cases where more than one study uses the same data set (e.g., national level EGRA data) to measure a literacy outcome, we will use the effect size from the study with the lowest risk of bias. If the risk of bias is the same, we will estimate an average effect size through inverse-weighted random effect meta-analysis. In cases where one study measures the same outcome at different points in time, we will extract the effect size by relying on the outcome measure that was measured closest to the time period of the measurement in other studies included in the same meta-analysis. In cases where studies include more than one treatment arm, we will include the effect size from the treatment arm with the lowest risk of bias. If the risk of bias is the same, we will use the effect size from the treatment arm that is most similar to the other programs included in the meta-analysis.

| Meta-analysis
We will pool the results of the quantitative studies that focus on the same outcome variables and same intervention types using meta-NAKAMURA ET AL.
| 11 of 16 analysis. In other words, we will conduct multiple different metaanalyses based on intervention type and outcome variable (described above which reading outcomes will be pooled if necessary). We will examine the heterogeneity of the effect sizes for each outcome across studies and use meta-regression to model the variation in effect size and will use forest plot visualization (Borenstein et al., 2009). We will use Stata to conduct the meta-analysis.
For the meta-analysis, we will include only studies with an emphasis on LOI choice that use one of the following designs: (1) experimental designs using random assignment to the intervention and (2) quasi-experimental designs with nonrandom assignment (such as regression discontinuity designs, 'natural experiments', and studies in which participants self-select into the program).
Where possible, we will perform sensitivity analysis for potential moderators: • Risk of bias status for each risk of bias category; • Study design (randomized controlled trials vs. quasi-experimental studies); • Gender • SES • Parental literacy levels • Alignment of language spoken at home with LOI • Geography We will use random-effects meta-analysis because the average effect of LOI choice is likely to differ across contexts due to differences in program design and contextual characteristics. We will supplement our random-effects meta-analysis with network metaanalysis to enable indirect comparisons of two treatments that have a common comparator.
We will also use stratified meta-analysis according to contextual and methodological moderator variables to investigate factors explaining heterogeneity. We will use two contextual moderating variables: (1) type of orthography and (2) grade.

| Missing data
In cases where it is not feasible to estimate the effect size because of missing data, we will contact the researchers to request the missing information to calculate the effect sizes. If authors do not respond or do not provide sufficient information to calculate the effect size, we will not include the study in the meta-analysis. Even so, the study and its findings will still be discussed within our narrative write-up.

| Treatment of qualitative research
Every study that is selected for full-text review, will undergo a full quality appraisal.

| Quality appraisal
We will assess the quality of the qualitative studies using the nineitem Critical Appraisal Skills Programme Qualitative Research Checklist (Critical Appraisal Skills Programme, 2013), judging the adequacy of stated aims, the data collection methods, the analysis, the ethical considerations, and the conclusions drawn. For each item, one trained researcher on our team will independently fill out the appraisal to determine whether the study had adequately met the item and gave 'yes', 'no', or 'can't tell' response. Afterwards, they will then come together to discuss their responses to each item until they reach consensus. We will rate studies that score 0-2 'no' or 'can't tell' responses as low risk of bias, studies that score 3-5 'no' or 'can't tell' responses as medium risk of bias, and studies that score 6-9 'no' or 'can't tell' responses as high risk of bias (Table 2).
After full-text review, we will conduct a thematic synthesis of the qualitative study findings. Each study's main findings will be coded to encapsulate the content of each findings (e.g., 'the teacher joins Portuguese and the local language to help the student understand', 'we use [the local language] only to pull the student from where he is and understand the subject'). These statements are then categorized into higher order themes (such as 'use of local language in postcolonial language classes'). We will then extract implications for better understanding why or how multilingual education choices work in various contexts.

| Critical discourse analysis
To answer our secondary research question focused on the overall policy messages being conveyed by the key donor and stakeholder agencies around LOI choices, we will conduct a qualitative critical discourse analysis (CDA), through Systemic Functional Linguistics (SFL) analysis (including linguistic and textual analysis) and ideological analysis (Martin & Rose, 2007;Van Djik, 2006) on 2-3 key donor and MOE documents on LOI policies or strategies.
By using CDA, we will analyse the discourse in these LOI documents, and the LOI-related discourse that has been included and excluded by these donor institutions and understand the dynamics between development assistance network, donor institutions and, if feasible, nation-states. In addition, we will conduct CDA based on discursive psychology to understand the positions and power relations between these donor institutions and nationstates. The discursive nature helps explore the identity of stakeholders, their positions and their narration within a social context (Hajer, 1995;Hewitt, 2009). Using the discursive tradition will help reveal how donors justify certain models of MT based education and how donors persuade nation-states on LOI policies.
Together, the SFL and Ideological analyses approaches will help us identify how stakeholders present and/or consume a 'shared set of ideas' about language and education; for example, how is evidence discussed and applied, how are groups conducting MT education versus post-colonial medium of instruction discussed and described, and so forth.
SFL provides a framework through which we will analyse linguistic features used: how often do they use particular words?
What affect do those words carry? Do the linguistic features carry explicit or implicit power structures in terms of LOI choices and consequences? What type of evidence is used to justify certain models of LOI? How is the concept of 'mother tongue' described by the donor institution versus the MoE and on what narrative ideology is this concept based on? Given that labelling is one of the very first steps in realizing how language is treated in any language policy or planning decisions (Kaplan & Baldauf, 1997), and terminological variance abounds in LOI policies worldwide, this will also shed considerable light on the links between how donors, other decision makers, and the layman consume and use information and evidence on language decisions in education.
These approaches will allow us to select, analyse, and interpret the essential messages embedded in donor policies and discourse around language issues in education policies. The primary theoretical premise of CDA is that language choices are shaped by, and shape, society and that language policy is influenced largely by historical, social, political and ideological environments in nation-states (Fairclough, 2009). As such, this approach will help various stakeholders-MOEs, teacher training institutes, implementing organizations-understand and to objectively evaluate donors' monetary prioritizations; as well as for donors to review, and if necessary, revise or adapt their existing and future education policies to incorporate a more evidence-based view of language issues.
Once the CDA has been conducted, we will disseminate the findings through webinars, blog posts, social media, and individual communications to key stakeholders, including MOE personnel and key members from donor organizations and implementing partner organizations. The results will be shared in such a way that they are tailored to each stakeholder through policy briefs, brochures, and easy to access materials. The dissemination will be closely followed by co-interpretation and planning meetings or workshops. In these meetings, the stakeholders will be encouraged to closely analyse existing documents and discourse around the role of language in education-including assessing whether the discourse itself is absent -and then work together to determining changes that could likely steer the course of LOI policy in a direction more in line with the evidence, as well as highlight gaps that are still hindering decision makers' abilities to move forward with more effective LOI policies and practices.

| QUALITY ASSURANCE (QA)
AIR has a rigorous system for QA, which ensures that we deliver high-quality products. Dr Thomas de Hoop serves as our QA reviewer on this project; he has over 13 years of experience designing and managing large-scale systematic reviews in education in LMICs. He will be responsible for providing support to the team and for ensuring the quality of the research materials produced.
Specifically, Dr. de Hoop will support the project team by providing inputs on the design of evidence synthesis' protocol, refining work and analysis plans, reviewing analysis and results, making suggestions about interpretation of further analysis, and reviewing final research products including blogs, policy briefs, and the final synthesis report.
Dr. de Hoop will sign off on all drafts and final drafts after QA of each deliverable.
6.1 | COVID-19 risk mitigation AIR has extensive experience facing and overcoming challenges associated with managing and conducting research. While the desk-based nature of the evidence synthesis greatly reduces the risks to the project, especially considering the current COVID-19 pandemic, we are aware that a few potential risks remain. For instance, there is the possibility that one or more of our project team members may be directly or indirectly affected by COVID-19, subsequently, reducing their ability to work on the project.
However, we developed the project team such that every position has backup support from another staff member, so the project is unlikely to be delayed due to COVID-19 affecting any one team member. If we experience the unfortunate circumstance of multiple team members being affected by COVID-19 at the same time, this will require us to pull from our internal staffing networks to employ additional staff to support the project team with the research activities. Again, we do not foresee this resulting in any delays to the progress of the project.

ACKNOWLEDGEMENTS
Our sincere appreciation goes to the Centre of Excellence for Development Impact and Learning (CEDIL) for the financial and technical support provided. In addition, we would like to thank Dr.
Thomas de Hoop for his guidance and technical oversight of this review.