Multilingual assessment of early child development: Analyses from repeated observations of children in Kenya

Abstract In many low‐ and middle‐income countries, young children learn a mother tongue or indigenous language at home before entering the formal education system where they will need to understand and speak a country's official language(s). Thus, assessments of children before school age, conducted in a nation's official language, may not fully reflect a child's development, underscoring the importance of test translation and adaptation. To examine differences in vocabulary development by language of assessment, we adapted and validated instruments to measure developmental outcomes, including expressive and receptive vocabulary. We assessed 505 2‐to‐6‐year‐old children in rural communities in Western Kenya with comparable vocabulary tests in three languages: Luo (the local language or mother tongue), Swahili, and English (official languages) at two time points, 5–6 weeks apart, between September 2015 and October 2016. Younger children responded to the expressive vocabulary measure exclusively in Luo (44%–59% of 2‐to‐4‐year‐olds) much more frequently than did older children (20%–21% of 5‐to‐6‐year‐olds). Baseline receptive vocabulary scores in Luo (β = 0.26, SE = 0.05, p < 0.001) and Swahili (β = 0.10, SE = 0.05, p = 0.032) were strongly associated with receptive vocabulary in English at follow‐up, even after controlling for English vocabulary at baseline. Parental Luo literacy at baseline (β = 0.11, SE = 0.05, p = 0.045) was associated with child English vocabulary at follow‐up, while parental English literacy at baseline was not. Our findings suggest that multilingual testing is essential to understanding the developmental environment and cognitive growth of multilingual children.

potentially complicated to implement in linguistically diverse environments, it may allow children to learn more and may better permit their parents to engage with teaching materials and monitor student performance (Benson, 2002;Konsonen, 2005;Lieberman, Posner, & Tsai, 2014). In the case of Kenya, 45% of mothers of school-aged children cannot read English at a second-grade reading level (Uwezo, 2015); in one study, 72% of parents reported not understanding how to interpret student-learning data (Lieberman et al., 2014). Thus, a country's policy regarding language of instruction (LOI) can have significant implications for children's development in ways that interact with poverty, parental literacy, ethnicity, and other risk factors faced by vulnerable children as they move through the formal education system.

| Child assessment in multilingual environments
Child development assessments allow teachers to understand how and what children are learning, to diagnose learning differences or language disorders, and to benchmark achievement against national or international standards (Armon-Lotem, de Jong, & Meir, 2015;Snilsveit et al., 2016). Similarly, researchers and policymakers rely on child assessments to examine programme effectiveness. In both academic and nonacademic settings, students are routinely tested in only one language, either the LOI or parents' preferred language. It is challenging to assess child development, language disorders, and school readiness in such populations, both because these children develop linguistic skills in multiple languages simultaneously and because most widely used measures of child development have not been validated in local languages and low-and middle-income country (LMIC) contexts. Many assessments created and validated in U.S. or European samples do not demonstrate the same strong psychometric characteristics when applied in different settings (Fernald, Prado, Kariger, & Raikes, 2017). To capture the linguistic development of children in LMIC contexts, it is crucial to adapt, or develop, and subsequently validate assessments in children's mother tongues (Prado et al., 2018).
Child development assessments conducted in a single language may not fully reflect a multilingual child's developmental outcomes and learning trajectory (Cummins, 1979(Cummins, , 2001Peña, Bedore, & Kester, 2015). Bilingual children's conceptual vocabularies are similar in size to those of monolingual children; however, their vocabulary size in each language is smaller than that for monolingual children (Bialystok, Luk, Peets, & Yang, 2010;Hammer et al., 2014). The amount of overlap in children's vocabulary between the two languages may depend on how typologically related the two languages are (Hammer et al., 2014).
Furthermore, bilingual children's performance on language assessments in their second language may have more to do with exposure to the second language than knowledge transfer based on first-language proficiency (Keller, Troesch, & Grob, 2015). For this reason, children may perform better on certain aspects of the tests, such as letter sounds, syllables, or reading fluency, when they are tested in the LOI as compared to their native language (Bialystok, Majumder, & Martin, 2003). Greater reading fluency or decoding skills in the LOI, however, do not necessarily indicate that children have greater reading comprehension in the LOI (Piper, Schroeder, & Trudell, 2016;Piper, Zuilkowski, & Ong'ele, 2016). For multilingual children, assessment of language and other domains of development should account for all of the child's languages (Pearson, Fernandez, & Oller, 1993). Furthermore, the assessment should ideally capture the complexity of the child's language environment or the extent to which a child's language is specific to a certain context (i.e. school, home, or community) (Pearson et al., 1993;Toppelberg & Collins, 2010). To date, most studies of bilingual or multilingual child language development have been conducted in high-income countries (Barac, Bialystok, Castro, & Sanchez, 2014), although a few studies have been conducted in sub-Saharan Africa (e.g. Alcock, 2017;Alcock & Alibhai, 2013;Alcock, Holding, Mung'ala-Odera, & Newton, 2008;Cockcroft, 2016;Demuth, 2003;Potgieter & Southwood, 2016). Thus, limited data are available to help us understand young children's verbal development in LMIC contexts.

| Current approaches to child assessment across contexts
There is an inherent tension between the desire to employ widely used, well-validated measures and the need to adapt items to local contexts.
Assessments that are well validated in one context but not appropriately adapted for another may not maintain their properties (Peña, 2007) and may perform unreliably (Gibson, Jamulowicz, & Oller, 2017). This problem is particularly pronounced for tests designed and validated in high-income countries that, without thorough and careful adaptation, often generate items poorly suited to a LMIC context (Fernald et al., 2017;van de Vijver & Poortinga, 2005). Investigators generally have four approaches when using a measure in a new country or context: adoption (translation of an existing test without modification); adaptation (translation with careful modification of items, responses, and administration); expansion (adding items to an existing test to suit a particular cultural or linguistic context); or creation of new tests ( Figure 1).
These approaches have been used in the LMIC context (He & van de Vijver, 2012;Weber, Fernald, Galasso, & Ratsifandrihamanana, 2015) Research Highlights • This study measured vocabulary among Kenyan children 2-6 years old, at two time points, across three languages: Luo (mother tongue), Swahili and English (official languages).
• During testing, the youngest children strongly preferred to express themselves in Luo, whereas older children were more likely to respond in Luo and English.
• Luo receptive vocabulary among all children at baseline was significantly associated with English receptive vocabulary at follow-up, even accounting for baseline English and Swahili.
• Baseline caregiver literacy in Luo, rather than English, was robustly related to children's later receptive vocabulary in English. and in higher income contexts, where the parallel design of assessments is necessary to simultaneously test children's verbal development across multiple languages (Haman, Łuniewska, & Pomiechowska, 2015). When multiple tests are needed to comprehensively measure various capacities, a more diversified strategy may be to adopt some tests, adapt others, expand an existing test to include new test items, and create new tests that are internally valid for the context.

| Language policy in the African context
Over 2,149 mother tongue languages are spoken in Africa (Lewis, Simons, & Fenning, 2016), and more than a quarter of the African population speaks a native language that is not in official use in the educational system or by the government (Figure 2; Lewis et al., 2016). In spite of UNESCO's recent call for at least 6 years of mother tongue education (UNESCO, 2017), there are several reasons for resistance to mother tongue instruction. For example, parents and teachers sometimes believe that children who learn in the mother tongue language will fall behind those who learn in English (Jones, 2012;Trudell, 2007). In addition, linguistically appropriate teaching materials are not always available (Musau, 2003;Waithaka, 2017), and teachers may not be fluent in the local mother tongue (Manyonyi, Mbori, & Okwako, 2016;Trudell & Piper, 2014). The misalignment between children's first languages and those used in schools has important implications for the assessment of school readiness and learning outcomes: namely, children from linguistically marginalized families risk being underserved by the educational system.

| Current study
Our study took place in the Luo-speaking region of western Kenya, a country with 68 spoken languages (Lewis et al., 2016). English and Swahili are the official languages (i.e. for all government proceedings and publications), but literacy rates in these languages, while perhaps relatively high within sub-Saharan Africa, are still quite low. For example, only 55% of mothers of school-aged children, and about 51% of children aged 7-13 years can read English at a second-grade level (Uwezo, 2015). In our study area, only 31% of young primary school students are taught in Luo, while the rest are taught in either English or Swahili (Piper & Miksic, 2011).
The purpose of this study was to compile a set of child development assessments to evaluate the effects of a literacy promotion programme on multilingual children's development. Our first aim was to validate language assessments for children aged 2-6 years.
Our second aim was to understand children's performance on receptive vocabulary assessments in mother tongue (Luo) and official languages (English and Swahili), and the extent to which scores on each of these assessments were associated with children's receptive and expressive vocabulary at a 5-to 6-week follow-up. We hypothesized that baseline scores in all languages, but especially English, would be significantly associated with child English receptive vocabulary at follow-up, as they all measure aspects of language skill. Our final aim was to examine the relationship between caregiver literacy--in both the mother tongue (Luo) and the LOI (English)--and child receptive and expressive vocabulary and to test whether the strength of the association between mother tongue and LOI literacy varied F I G U R E 1 Proposed strategies for measuring early child development in Western Kenya, with examples from study with caregiver literacy. We focused primarily on children's English vocabulary at follow-up as an indicator for school readiness, as this is the LOI at higher grade levels and the de facto language of instruction for many young children in our study area. We hypothesized that caregiver literacy in both languages at baseline would be significantly associated with child receptive and expressive language at follow-up, and that the association between the baseline measure of child receptive vocabulary in Luo and English vocabulary at follow-up would be strongest among children whose parents had lower English literacy.

| Study design and sample description
The measures described in this paper were developed for an ongoing cluster-randomized trial in Kenya's Kisumu and Homa Bay Counties that is designed to evaluate the effectiveness of a book distribution and parenting training programme on child development (see trial registry: https ://doi.org/10.1186/ISRCT N6885 5267 and pilot results: Knauer, Jakiela, Ozier, Aboud, & Fernald, in press). Families with at least one child between the ages of 24 and 83 months were recruited from a set of nine primary school catchment areas in rural communities within two hours' drive from Kisumu. A total of 357 primary caregivers (one per household) and 510 children were assessed during household visits (average 1.43 children per household); five child assessments were incomplete, resulting in an analysis sample size of 505. A total of 442 children were assessed at follow-up (5-6 weeks later), with 68 children lost due to relocation or difficulty in making contact.

| Overview of child assessments
To develop our test battery, we used adoption, adaptation, expansion, and creation of new tests for different developmental domains.
All assessments were translated to Swahili or Luo and then backtranslated to English by a different team of translators (two for each language) who did not have access to the original measure. The first and second authors (HK and PK) then met with a group of translators and discussed each translation to ensure that words conveying the desired meaning were chosen over direct translation (In Swahili and Luo, several words were often possible depending on the intent of the item). The assessments were then pretested, and any additional study team concerns or discrepancies were addressed. Items for the vocabulary assessments were ordered by difficulty, as measured in a small pilot sample (between 30 and 61 respondents).
F I G U R E 2 Proportion of population speaking a native language used in any official capacity. Notes. Figure shows proportion of population whose native language is classified as an "institutional" language in the ethnologue (Lewis et al., 2016). Institutional languages include national and provincial languages (used in government), languages other than national and provincial languages that are used in institutional education, and languages used for "wider communication" through mass media The assessors hired to administer the tests in the current study had university degrees, were from the study area, spoke Luo as their mother tongue, and were trained on the full battery of tests by the first and second study authors. On a subset of 48 children, two assessors double coded the baseline assessment to assess interrater reliability (IRR) for each of the assessments (Table S1).

| Receptive vocabulary
We created receptive vocabulary assessments based on the British Picture Vocabulary Scale III (BPVS III) (Dunn, Dunn, & Styles, 2009), which includes 168 items for individuals 3-16 years old (see details of translation and adaptation in Appendix A). Knowledge of words is measured by asking the respondent to point to one of four pictures that corresponds to a word (object, person, or action) spoken by the assessor. The BPVS has been adapted for use in South Africa (Cockcroft, 2016) and Indonesia (Prado, Alcock, Muadz, Ullman, & Shankar, 2012) and is the British adaptation of the Peabody Picture Vocabulary Test (Dunn & Dunn, 1997), which has also been used in neighboring areas of Kenya (Ozier, 2018). As we wanted to capture young children's knowledge of Luo, Swahili, and English words, we created three sets of nonoverlapping words of varying difficulty, with 27 Luo items, 32 Swahili items, and 34 English items. Administration ended when a child failed six out of a set of eight items.

| Expressive vocabulary
We developed our own measure of expressive vocabulary after reviewing various expressive vocabulary tests and concluding that the stimulus words and/or pictures were not appropriate to the context (see details in Appendix B). The assessment was a picture-naming task, in which children were presented with flash cards bearing a single illustrated stimulus item or object (noun) per card and were asked in the child's preferred language, "What is this?" for each item.
Children were not instructed as to which language to respond in, but responses in any language were accepted. We did not provide further instruction because code-switching during conversation is common in this area, and very young children may not be aware which language they are actually speaking for a given word. Thus, a child could respond to each item in the 20-item test in English, Luo, or Swahili to score a pass for expressing the word verbally.
Administration ended with three consecutive fails.

| Other child-level assessments
The Malawi Developmental Assessment Tool (MDAT) was created and validated for use in rural Malawi with children 1-84 months of age (Gladstone et al., 2010). It includes four 34-item subscales (fine motor/perception, language/hearing, gross motor, social-personal), with many items adapted from existing Western tests (see details of our adaptation in Appendix C). The MDAT is currently being used in various countries, including Mali, Sierra Leone, Rwanda, Burkina Faso, and Zimbabwe (M. Gladstone, Pers Commun., June 24, 2016).
The western Kenya adaptation was initiated by the first, second, and fifth authors (HK, PK, and LCHF) for the Kenya Life Panel Survey (e.g. Baird, Hicks, Kremer, & Miguel, 2016), a longitudinal study that examines the intergenerational effects of health investments. We used the translations and piloting data from that study to further adapt and expand the language and fine motor/perception subtests of the MDAT for this study. The final adapted language test had 26 items. To further reduce the overall length of the test, we created start and stop rules for three different age groups (24-35 months; 36-59 months; 60-71 months) based on pass rates during piloting.

| Caregiver survey
Data were gathered on household assets, housing quality, household size and composition, and the age and education level of primary caregivers. In addition, we assessed caregiver literacy by asking caregivers to read a simple, five-word (second-grade level) sentence in each language adapted from the Early Grade Reading Assessment (EGRA; Gove & Wetterberg, 2011). Caregivers who read more than one word incorrectly in all three languages were categorized as illiterate. Working memory in caregivers was assessed using a summary score of the forward and backward digit span test (Ozier, 2018

| Statistical analysis
To address the first aim of validating our assessments by examining their psychometric properties, we measured: (a) the internal consistency of the measures using Cronbach's alpha; (b) IRR using Cohen's kappa, Krippendorff's alpha, and percent agreement; (c) construct validity by examining the correlations between the measures; and (d) convergent validity by examining associations with known covariates in bivariate regressions. For our second aim, to better understand the relationships between baseline measures of mother tongue and LOI receptive vocabulary and scores on subsequent vocabulary assessments, we estimated a series of ordinary least squares (OLS) regression models to examine the associations between baseline age-standardized receptive vocabulary scores in all languages (English, Swahili, and Luo) and English receptive vocabulary at follow-up. We repeated this analysis for follow-up measures of child expressive vocabulary as well as Swahili and Luo receptive vocabulary; we present these results as supplemental analyses. Our final aim was to examine the association between caregiver literacy and child vocabulary at two time points. We used OLS regression to examine the association between baseline caregiver literacy in Luo and English and child English and Luo receptive and expressive vocabulary scores at follow-up. To test whether the relationship between baseline mother tongue and LOI receptive vocabulary scores and follow-up LOI receptive vocabulary varied with caregiver literacy, we estimated OLS regression models that included both caregiver literacy and baseline child receptive vocabulary (in English, Luo, and Swahili). All regressions used age-adjusted z-scores for child vocabulary, and standard errors were adjusted for household clustering. All statistical analyses were conducted using Stata 14.2.

| Descriptive statistics
The average age of children in the study was 54.42 months (range 24-83 months) ( Table 1). About one-quarter (27%) of caregivers were illiterate. Maternal and household characteristics were similar to those observed in the representative 2014 Kenya Demographic and Health Survey sample for the study area.

| Psychometric properties of the instruments
The internal consistency of the vocabulary measures ranged from α = 0.57-0.90: Cronbach's alphas were lowest for the expressive and English receptive tests and highest for the MDAT language test (Table S1). The internal consistency of the receptive vocabulary assessments was higher for Luo (α = 0.78) and Swahili (α = 0.76) than for English (α = 0.57). The IRR of the receptive vocabulary tests was κ = 1 for Luo, κ = 0.89 for Swahili, and κ = 0.95 for English. The internal consistency of the expressive vocabulary test was α = 0.67, while the IRR was κ = 0.95. The internal consistency of the MDAT fine motor and language tests was α = 0.94 and α = 0.90, respectively. IRR of the total score for each measure was κ = 0.93 for fine motor and κ = 0.86 for language. Correlations among the baseline child development assessments ranged from r = 0.32-0.56, and all correlations were statistically significant at the p < 0.001 level (Table S2). The three age-normalized receptive vocabulary scores were all moderately correlated with each other, the expressive vocabulary score, and the MDAT scores, while the expressive vocabulary score was also moderately correlated with both MDAT scores. The MDAT tests had the strongest correlation (r = 0.56) with each other; among the vocabulary assessments, they had the strongest correlations with Luo vocabulary (r = 0.48-0.49). The associations between baseline child, caregiver, and household characteristics with child age-adjusted child development scores are presented in Table S3. In bivariate regression analyses adjusted for household clustering, child height-for-age z-score was significantly associated with all child development assessments (β = 0.25-0.33, SD = 0.03-0.04, p < 0.001 for all). Caregiver characteristics (education, literacy, and cognition) were most strongly associated with child expressive vocabulary and MDAT scores, while caregiver depressive symptoms were not associated with any child assessments. Finally, household characteristics were not consistently associated with child assessments.
Overall, 189 children answered the expressive vocabulary test entirely in Luo, while 13 children answered entirely in English, and 6 children answered entirely in Swahili ( Table 2). The other 297 children (58%) answered in more than one language; the number of children answering in only one language decreased with age. Across all ages, children answered more expressive vocabulary words in Luo, followed by English and then Swahili. The fraction of expressive responses given in Luo decreased from about 89% among 2-year-olds to about 67% among 6-year-olds, while the fraction of responses given in English increased from about 5% among 2-year-olds to about 29% among 6-year-olds ( Table 2). The percentage of responses given in Swahili was small (7% among 2-year-olds) and decreased slightly with age. The youngest children showed a clear preference for expressing themselves in their mother tongue, as was evident in the patterns of response in our expressive vocabulary test (Figure 3).

| Children's vocabulary at baseline and followup
In bivariate analyses, the baseline measure for each language was most strongly associated with the corresponding follow-up measure (Tables 3, S4

| Associations between caregiver literacy and child vocabulary
In our final analyses, we examined caregiver baseline and child follow-up measures in English (the primary LOI at older grade levels) and Luo (the mother tongue for 95% of our sample). After adjusting for caregiver education and household wealth, caregiver literacy in Luo was significantly associated with children's receptive vocabulary in English (β = 0.11, SD = 0.05, p = 0.045), while caregiver literacy in English was not (Table 4). Caregiver literacy in either language was not significantly associated with children's receptive vocabulary in Luo or their expressive vocabulary. Moreover, controlling for caregiver literacy (in English and Luo) did not alter the pattern of associations between children's baseline receptive vocabulary (in English,  (Table S7).

| D ISCUSS I ON
In this study, we assessed the language development of 2to-6-year-old multilingual children at two time points in a rural, ethnically homogenous region of Kenya. Notably, we found that Notes: English, Swahili, and Luo receptive vocabulary are measured using three assessments based on the British Picture Vocabulary Scale (BPVS). Vocabulary scores are age-adjusted z-scores for children ages 2-6 years. Baseline and follow-up were conducted approximately 5 weeks apart. *p < 0.1. **p < 0.05. ***p < 0.01. ****p < 0.001.
Luo (Piper, Schroeder, et al., 2016). While children's familiarity with English through their classroom exposure is high, their actual understanding of English is often quite low (Trudell & Piper, 2014). This situation is likely to be common to many African contexts since many children learn to read in a language other than their mother tongue (Lewis et al., 2016).
In the process of vocabulary development, children typically first acquire receptive knowledge of a word (i.e. they recognize and understand the word when it is spoken or read), only later developing the ability to produce the word (expressive vocabulary) either by speaking or writing (Burger & Chong, 2011). By age six, children's receptive vocabulary is usually larger than their expressive vocabulary,

F I G U R E 4
Receptive vocabulary test performance, by language, and child age F I G U R E 5 Percentage changes in R-squared relative to test-retest specification. Notes. Figure depicts changes in R-squared in relation to a regression of follow-up receptive measures of each language on the baseline measure of the same language. The dark bars shows that the regression of follow-up English receptive vocabulary on all three languages at baseline yields a 37% increase in R-squared over just using English at baseline, while other languages gain less than 10%. The lighter bars show that the regression of follow-up English receptive vocabulary on only the next-most-strongly associated with baseline language besides itself only reduces the R-squared by 24%, while the next-best language reduces R-squared by more than 50% for follow-up measures of languages other than English. Notes: Receptive English and Luo vocabulary scores are age-adjusted z-scores for children ages 2-6 years, measured using separate assessments based on the British Picture Vocabulary Scale (BPVS). Expressive vocabulary z-scores were measured using a tool developed from the PPVT. Caregiver literacy is number of words (out of 5) a caregiver could read from a simple sentence at a second-grade reading level, adapted from the Early Grade Reading Assessment (EGRA). Baseline and follow-up were conducted approximately 5 weeks apart. The first two models for each vocabulary assessment (cg1 & cg2, cg5 & cg6, cg9, & cg10) are bivariate regressions. The third model for each vocabulary assessment (cg3, cg7, and cg11) include caregiver literacy in both languages. The fourth model for each vocabulary assessment (cg4, cg8, and cg12) add controls for caregiver education and household wealth.

TA B L E 4 The association of baseline caregiver literacy in English and Luo and child follow-up vocabulary scores
*p < 0.1. **p < 0.05. ***p < 0.01. ****p < 0.001. although they may also learn to say words before they fully understand them (Burger & Chong, 2011). In our examination of the relationship between children's receptive and expressive vocabulary, we found that the strongest measures of language development at follow-up were baseline expressive vocabulary (in any language) followed by receptive vocabulary in Luo. However, expressive vocabulary is often not measured in research studies in LMICs-for example, because very young children can be too shy to respond, or may respond correctly in any one of several languages, which makes it more complicated to code responses. Our finding that 59% of children used multiple languages in their expressive responses confirmed our assumption that code-switching was common.
Caregivers' literacy in mother tongue at baseline provided an indicator of children's school readiness (as measured by English vocabulary), while caregivers' English literacy skills at baseline did not. A central limitation of this study is that it took place among a rural and ethnically homogenous group of children, so the findings may not generalize to an urban or ethnically mixed setting (Hungi, Njangi, Wekulo, & Ngware, 2017). In mixed ethnicity households or very diverse communities, the associations between mother tongue vocabulary and subsequent LOI vocabulary may not be as strong. However, even within this homogenous group, we had to navigate a multilingual environment to implement language assessments, which presented several inherent challenges. First, items (e.g. "playground") that perform well in high-income contexts may be unknown to children in other settings. Additionally, concepts that are represented by a single more difficult word in the original test language may translate to a phrase built from much simpler words: for example, "nest" translates in Luo to od ("covering" or "housing") winyo ("bird") (Capen, 1998), making it an easier word in Luo than the same word in English; thus, the ordering of item difficulty may no longer be appropriate. Finally, even linguistically accurate translations may not retain what some have called "psychological similarity" (van de Vijver & Poortinga, 2005). This is when an item taken from one setting may not have the same psychological meaning in a different context, such as "What do we do before crossing the road?" Therefore, a core strength of our study is the rigorous adaptation, translation, and validation process that we performed for our assessments and our testing of children across a broad age range in multiple languages. This process allowed us to document more fully how children's vocabulary in different languages evolves with age and how receptive vocabulary measures in mother tongue and the LOI were associated with vocabulary development 5-6 weeks later.
In a multilingual context, as is common in LMICs, there is a question of how best to support young children's language and cognitive development. Should pre-primary educational materials be in the local mother tongue-i.e. children's first language-or in English, the language in which children will eventually be instructed and tested in primary school? Our findings raise the possibility that to best support the language development of children before school age, early childhood interventions-especially those targeting parents-might do well to include instruction and materials in mother tongue, as a child's first language lays the foundation for learning in other languages and for general readiness for school (see also Altan & Hoff, 2018;Hoff & Ribot, 2017).

A recent review of language of instruction policies in Eastern and
Southern Africa found that 14 out of 21 countries introduce English as the LOI before fifth grade (Trudell, 2016a). However, it may be particularly challenging in Africa to implement UNESCO's guidelines of at least six years of mother tongue education because of the continent's high degree of linguistic heterogeneity. As a concrete example, Kenya's formal educational policy mandates that early primary instruction be conducted in the mother tongue in rural areas and in Swahili in urban areas--with a transition to English at Grade 4 in either case--however, this policy is only loosely followed in practice, illustrating the practical challenges inherent in such complex environments (Manyonyi et al., 2016;Trudell, 2016b).
Vocabulary assessment of young children in only one language, particularly if not in their mother tongue, risks inadequately capturing children's development. Foundational work in the study of bilingual education has pointed out the interdependence of language skills across languages for bilingual children, but has focused exclusively on high-income country examples (e.g. Cummins, 1979).
Monolinguals and bilinguals may learn school-centric words in the LOI equally quickly, but bilingual children may differentially know home-centric words in their first language rather than the LOI, thereby complicating the interpretation of assessments conducted in a single language (Bialystok et al., 2010). As a specific example of the interplay of languages in Africa, Shin et al. (2015) found that in Malawi, Chichewa literacy in Grade 2 was a predictor of subsequent English skills in Grade 3.
A recent study in Kenya found no additional benefit from mother tongue instruction in primary school on children's language development, but only assessed children's linguistic development in English and Swahili (Piper, Zuilkowski, Kwayumba, & Oyanga, 2018). However, a separate study found that the PRIMR programme (which provides teacher training and instructional supports to improve language and math skills in early primary grades) improved oral reading fluency and reading comprehension in mother tongue (Piper, Zuilkowski, et al., 2016). Our findings suggest that receptive vocabulary in a child's mother tongue may be a particularly important measure of linguistic development, even when the outcome of interest is the language of instruction. Children's vocabulary in their mother tongue may better reflect the level of stimulation and conversation they receive at home, while children's vocabulary in the LOI indicates their exposure to that language. Multilingual testing of parents and children is essential in order to understand the developmental status of multilingual children as well as factors that affect their development in LMICs.

ACK N OWLED G M ENTS
We thank Sheyda Esnaashari, Saahil Karpe, Rohit Chhabra, and Emily Cook-Lundgren for research assistance throughout this project. We thank the staff of Innovations for Poverty Action (Kenya), specifically, Patricia Gitonga, Michael Meda, Jessica Jomo, and the field team they led during data collection for this project. We are especially grateful to the KLPS team, who generously shared their previously adapted instruments with us. This work was supported by The World Bank, Washington, DC (via three facilities: the Strategic Impact Evaluation Fund, the Early Learning Partnership, and the Research Support Budget). The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily reflect the views of The World Bank, its board of executive directors, or the governments they represent.

CO N FLI C T O F I NTE R E S T
The authors have no conflicts of interest to declare.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available upon reasonable request to oozier@worldbank.org. This process produced the following results: for 9 of the stimulus words, the translation to Luo and Swahili resulted in the same word (for example, "money" is referred to as "pesa" in both languages; "airplane" is "ndege" in both languages); 26 words had no commonly known directly translated equivalent in Luo among adult speakers interviewed; 12 words had no equivalent in Swahili, and 11 had no translation in either Luo or Swahili; 13 Luo and 10 Swahili translations were not single words, but were phrases, or words with a qualifier (e.g., "gigantic" was translated as "very big" in Swahili); 5 Swahili and four Luo words were appropriated from English, and were identical or nearly identical to the English word; and for 12 stimulus items, the words were suitable for piloting, but the pictures were inappropriate or unfamiliar. Eight plates had words that were more likely to be known in English and were not translated. Luo was the most difficult language to work with, as it had the most limitations in translating the stimulus words, so we created this list first. To increase our capacity to select the best words possible, we (PK, OO) engaged a focus group of 6 Luo-speaking mothers and teachers to review 32 words for which we either had no suitable translation, or that were candidate items not part of the original BPVS III.

O RCI D
We piloted 58 items with about 30 Luo-speaking children 2-6 years of age, and examined pass rates for each word by age group (younger or older than five years). Words were then grouped by estimated difficulty level (hard, average or easy), and we sought to have roughly the same number of items at each of the three difficulty levels. As we lacked a sufficient number of Luo words across the three categories, we created some new stimulus words based on results from the focus groups (lantana, bull, roar), providing our own plates of pictures for each. The final Luo test includes 27 items, with approximately 9 words in each difficulty level. Of these 27 items, three were new words and pictures, three involved translations including two words (no one-word translation was known), one used a distractor picture to replace the original stimulus picture ("boulder" replaced "mountain"), one replaced a plate with more familiar looking pictures, and one slightly changed the stimulus word (from "applauding" to "clapping," as our translators were aware of no distinct word for "applauding.").
We repeated a similar process for creating tests in Swahili (32 items) and English (34 items), each with roughly one-third of items in the three difficulty categories. The Swahili test included two items altered by changing the stimulus word to a distractor picture, and one item with the stimulus word slightly altered (from "sawing" to "cutting"). Nine items were changed for the English test. Five stimulus words were changed to better reflect the English words used for the stimulus picture (e.g., "zipper" was changed to "zip;" "sedan" was changed to "saloon car"); three distractor pictures replaced original stimulus items; and a new item was introduced, using an existing plate ("thumb"). With our final set of words we tabulated the rate of correct responses for each item, then sorted the items in descending order by the rate of correct responses.
Based on data collected from the full study population, we used item response theory to assess the content validity of the receptive vocabulary measures. This analysis allowed us to understand the relative difficulty and discrimination of the items, and the equiv- Note: Estimated using IRT one parameter and two parameter logistic models. The first parameter is item difficulty, and the second is item discrimination. One parameter models estimate an overall discrimination that is held constant for each item. Two parameter models allow the item discrimination to vary across items. We used a likelihood ratio test to compare the two models to determine the model with best fit. A two-parameter model would not converge for the English receptive vocabulary, so only item difficulties were estimated.

APPENDIX B Expressive vocabulary test creation
To create the expressive vocabulary measure, we began with images from pieces of the BPVS III plates, local storybook illustrations used in a related project, as well as other simple drawings, we presented up to 200 individual pictures to 61 children 2-6 years of age, asking them to name the object or concept the picture showed. We recorded responses in the language (English, Swahili or Luo) children used.
We then reviewed pass rates (in any language) by age (younger or older than 5 years) for each picture, and discarded words with no clear response (e.g., multiple responses for any picture Thus, for all similar words, the field team agreed on the differences in their pronunciation that would identify a child's response in a given language.
The full expressive vocabulary assessment is provided below, with the intention of making the tool freely available for use for research purposes only. The tool is not meant for, and should not be used for diagnostic purposes. We did not establish any norms. It is also not intended for use as an instructional aid. The images were printed on flashcards, single sided, and with one image per card.
Children were asked in their preferred language, "What is this?" for each item. Children were not instructed as to which language to respond in, but responses in any language were accepted. We did not provide further instruction because code-switching during conversation is very common in this area, and very young children may not be aware which language they are actually speaking for a given  Table 7).