Long-run effects of teachers in developing countries

How persistent are teacher effects on student outcomes? A large literature has established the effects of individual teachers on students in the United States (Hanushek & Rivkin, 2010; Koedel, Mihaly, & Rockoff, 2015). Fewer studies consider teachers in developing countries, and none in relation to longer-run outcomes. Estimates from the USA put the economic value of an effective teacher in the hundreds of thousands of dollars, based on future wage gains of their students (Chetty, Friedman, & Rockoff, 2014b). In this paper we estimate the persistent effect of teacher quality in Ethiopia and Vietnam. Different approaches have been taken to defining and measuring teacher quality. The “Measures of Effective Teaching” (MET) project in the USA directly compared three measures: using growth in student test scores, using lesson observations, and using student feedback. The project found that the best Received: 2 April 2019 | Revised: 29 January 2020 | Accepted: 27 July 2020 DOI: 10.1111/rode.12717


CRAWFURD AnD ROLLESTOn
while Vietnam compares favorably with high-income countries (see Figure 1). Vietnam ranks 27th of 157 countries on the World Bank harmonized learning outcome scale, with Ethiopia ranking 131st. In science, Vietnam ranked 8th of 72 mostly high-income countries in the 2015 Programme for International Student Assessment (PISA). Results from school-based tests such as PISA are biased estimates of population averages in lower-income countries due to higher dropout rates than in OECD countries. Nonetheless, Vietnam also performs well on the component of the Young Lives household-survey-based assessment that links to the international Trends in International Mathematics and Science Study (TIMSS) scale (Singh, 2014). Vietnam appears to have avoided a "quality-quantity trade-off" in expanding education provision (Rolleston, 2016), achieving high levels of learning for a majority of pupils and also low levels of inequality in test scores. There are also a relatively high number of high-scoring pupils from disadvantaged backgrounds (see OECD, 2016). These features point towards a context in which school and teacher effectiveness are potentially particularly informative.

| DATA
We use two sources of overlapping data from the Young Lives study. The first is a longitudinal study following the same households for five survey rounds over 15 years (the "household survey"). The second is a school-based study that examines the children from the household survey alongside their F I G U R E 1 GDP per capita and harmonized test scores. This figure presents results from the World Bank Harmonized Test Score database (Patrinos & Angrist, 2018) classmates and teachers over a single school year (the "school survey"). The overlap in the timing of the surveys is presented in Table 1.
The household survey has followed and administered questionnaires and assessments to 12,000 children over 15 years. The survey is divided into two birth cohorts. We focus on the younger cohort (born in 2001-2), many of whom are also observed in the school survey. We make use of Rounds 3 (2009-10), 4 (2012-13), and 5 (2016-17) of the household survey. The cohort is aged 8 in Round 3, and 15 in Round 5. Each household survey includes assessments in mathematics and reading. These assessments are not specific to country curricula and are suitable for international comparisons.
The school survey was conducted in 2011-12 in Vietnam, when children were in Grade 5, and in 2012-13 in Ethiopia, when children were in Grade 4 or 5. In Ethiopia the school survey covered 13,725 students in 92 schools, of whom 549 are followed in the household survey. This includes all students in the class of the household survey child. In Vietnam the survey covered 3,284 children in 56 schools, of whom 1,138 were followed by the household survey. In Vietnam this comprised a random sample of 20 students per class (including the student from the household survey). The school survey included assessments in mathematics and reading comprehension conducted at both the beginning and end of the school year. Further details are provided in Rolleston, James, Pasquier-Doumer, and Tran (2013).
In both countries, a sentinel-site sampling design is employed, comprising 20 purposively selected sites chosen to represent national diversity, but with a pro-poor bias. At the site level, children were selected randomly in 2001 to be representative of the birth cohort in each site (see Boyden & James, 2014, for full details). The sites in Ethiopia are located in five regions; Addis Ababa, Amhara, Oromiya, Tigray, and the Southern Nations, Nationalities, and People's Region (SNNP). The sites in Vietnam are in five provinces: Lao Cai, Hung Yen, Da Nang, Phu Yen, and Ben Tre. Each province sample contains four sites, and each site is formed of one or two woredas in Ethiopia and communes in Vietnam.
Our main outcome variables are the mathematics and reading comprehension test scores of students in the Round 5 household survey , administered during the household visits. That is 5 years after exposure to the "treatment" teacher. 1 We also consider test scores in the Round 4 household survey (2012-13), 2 years after exposure. From both Round 4 and Round 5 we also include the Peabody Picture Vocabulary Test (PPVT), a measure of receptive vocabulary which is sometimes used as a more general cognitive development indicator. In this test, the interviewer presents to the child a series of pages that contain four pictures. The interviewer says a word and the child has to correctly identify the picture that best corresponds to the word. Further, we also employ a set of simple measures of non-cognitive skills, attitudes, or dispositions. Kraft (2019) shows that teachers have effects on both cognitive and non-cognitive skills, and Jackson (2018) shows the effects on non-cognitive skills matter more for long-run outcomes than teacher effects. We add to this literature by measuring the long-run effects on student non-cognitive skills to test whether short-run effects are persistent. Subjective well-being is measured using a Cantril ladder in which students place themselves on a scale from 1 (worst) to 9 (best). "Grit" is measured as the mean response to four questions, each answered on a four-point Likert scale. These are: To measure aspirations, we define a binary indicator for whether a student reported that they would like to complete university if they had no constraints. Finally, for Round 5 only, we calculate expected earnings at age 25, based on the average of what children reported as their expected maximum and minimum earnings at that age.
Our main treatment variable is the effectiveness of the teacher assigned in Grade 5, which is estimated using the test score data from the school survey. These test scores are derived from tests administered in schools which are in the form of multiple-choice tests specific to the country curriculum in mathematics and reading comprehension.
We also have student test scores from before they entered the treatment teacher's classroom, along with other characteristics, from the Round 3 household survey data (2009). These comprise PPVT results and scores from a basic math test appropriate for age 8. 2 Table 2 summarizes the student-level variables employed in the analysis.
The Vietnam school survey data includes 176 teachers. We drop 19 teachers from schools that have only one Grade 5 class, so that we can focus on within-school variation in teacher quality in order to limit potential biases arising from non-random selection of teachers into schools. The Ethiopia school survey includes 146 teachers. Teacher characteristics are summarized in Table 3.

| METHODS
To estimate the persistent effect of teacher quality we carry out a two-step procedure. We first estimate teacher quality using the school survey data that include test data for students at the start and end of a school year. We then link these estimates of teacher quality to later student performance.

| Estimating teacher quality
We estimate teacher quality with a standard student learning production function, following Todd and Wolpin (2003) and Chetty et al. (2014a). To estimate teacher effects using the school survey, we model student learning outcomes as a function of their (unobserved) ability, and all present and past individual, family, and school inputs. Lagged test scores act as a summary proxy indicator for all 6 | CRAWFURD AnD ROLLESTOn Note: Household wealth is a standardized (to mean 0 and standard deviation 1) asset index based on the first principal component of a list of assets including a phone, radio, TV, bike, car, motorbike, table, chair, fridge, electricity, and water. Higher education is a postgraduate degree in Vietnam and a post-secondary diploma or higher in Ethiopia. Self-confidence is a standardized (to mean 0 and standard deviation 1) index based on the first principal component of responses to a series of statements on self-efficacy, such as "I can get through to the most difficult students" and "I can get students to work well together". Student wealth is a standardized (to mean 0 and standard deviation 1) index of the average socioeconomic status of all individual students in their class (each being based on the first principal component of a list of household assets).

| 7
CRAWFURD AnD ROLLESTOn observed and unobserved inputs up to the point of that test so that we use a "value-added" framework. Moreover, by additionally controlling for contemporaneous household inputs, we can interpret any remaining changes in scores between teachers as due to that teacher. We estimate a lagged dependent variable ordinary least squares (OLS) value-added model given by Student test scores at the end of the school year y are regressed on lagged test scores 3 from the start of the school year (y i,t−1 ), a rich set of student characteristics (X ) (age, gender, asset index, ethnicity, boarding status, and number of meals eaten per day), dummy variables for individual teachers j , and school fixed effects (S s ). As these survey-based data sets only include one cohort of children, it is impossible to distinguish between classroom and teacher effects.
Causal interpretation of estimated teacher effects j (teacher quality) is impeded by the possibility of reverse causality: where there is more than one classroom per grade in a school, the allocation of students may be made by ability or aptitude ("tracking," "streaming" or "setting"). This does not appear to be a substantial issue in our data, however, since only a very small percentage of teachers report that classrooms are grouped by ability, 4 with the majority being allocated quasi-randomly. Allocation of teachers to groups of different abilities might also be non-random. Teachers might also choose particular teaching methods because of their students' abilities. However, controlling for lagged achievement should deal with this worry substantially. The "value-added" framework allows us to interpret results in terms of effects on student progress over a single school year, conditional on their starting points. Test scores from the beginning and end of the school year are calibrated concurrently using models based on item response theory so that they are directly comparable and reported on the same scale. This is possible given that the two tests contain a number of common (anchor or link) items (see Rolleston et al., 2013).
A range of specifications have been used in estimating teacher value-added in the literature. This includes two principal approaches: first, including a full set of teacher dummy variables; and second, a two-stage procedure which estimates the regression model at the student level without teacher dummies, and then averages student residuals by teacher. We prefer the full dummy set approach as it explicitly partials out any student-level covariates from the teacher effects, which controls to some extent for non-random placement of students into classrooms (demonstrated by Guarino, Reckase, Stacy, & Wooldridge, 2015, using simulated and real data).
Sampling variation in teacher effects is taken into account through adjustment based on the Bayesian shrinkage estimator (Aaronson, Barrow, & Sander, 2007). This "shrinks" estimates towards zero for teachers with small numbers of students, who would otherwise be more likely to have extreme values.

| Assessing sorting of students
An important concern in the estimation of teacher effects with Equation 1 is whether the systematic sorting of students of different ability into different classrooms might bias these estimates. The Rothstein falsification test shows that teacher effectiveness can be shown to predict prior student performance, which they cannot possibly causally affect (Rothstein, 2010). We replicate this finding in Table A2 in the Appendix. Several papers have argued that this test does not in fact falsify teacher value-added estimates. Goldhaber and Chaplin (2015) show theoretically and empirically that the Rothstein test can in fact "falsify" models that are unbiased, and that the sorting of students does not necessarily imply that models are biased. Koedel and Betts (2010) show that including sufficient controls (multiple cohorts of students, and student fixed effects) can remove sorting bias, and that this bias has a relatively small effect on the estimated variation in teacher effects. The Young Lives school survey asks directly how students are assigned to classes; the overwhelming majority of teachers report that they are assigned effectively at random (and not explicitly sorted based on ability). We conduct two additional checks that do show some evidence of sorting, results of which are reported in Table A3. First, we show that, within schools, pre-existing student characteristics are only slightly correlated with teacher characteristics (after controlling for school fixed effects). In particular, students with better prior test scores are slightly more likely to be assigned to teachers with more education, though this effect is not large. We also follow the approach of Aaronson et al. (2007) and calculate the average variation (standard deviation) of test scores within classrooms. We then compare the observed variation with the variation that would be produced through perfect sorting based on prior test scores, and through random matching of students and teachers. The average standard deviation in our data is 0.78, which is closer to what would be found through perfect sorting (0.81) than through perfect random assignment (1.04). Ultimately, as we control for prior test scores we explicitly control for sorting based on these observed characteristics. Interpreting our estimates as causal requires the assumption that there is no further sorting based on unobservable characteristics. In the case where there is still sorting our estimates would be biased upwards and can hence be interpreted as an upper bound.

| Estimating the persistent effects of teachers
Next, we take our estimates of teacher quality with the school survey from Equation 1 and use them to predict future test scores on the household survey. We regress later test scores (y i,t+1 ) on teacher quality estimated from Equation 1 ( ⋀ ), earlier student test scores (y i,t−2 ) and covariates (X i,t−2 ), using OLS: Standard errors are clustered at the teacher level. The inclusion of school fixed effects in Equation 1 reduces the chance that our results are driven by non-random sorting of teachers and students into different schools. Hence our estimates of teacher quality are based only on within-school variation in teacher quality. This means, however, that we are understating the true variation in teacher quality, which will vary across schools as well as within them.

| Correlates of good teachers
Finally, we estimate the correlates of teacher effectiveness: We use the estimates of teacher quality ⋀ obtained from Equation 1 as the outcome variable, and teacher characteristics taken from the teacher interview component of the school survey. (2) (3) ⋀ = 1 ⋅ Z + 2 ⋅ effort + 3 ⋅ knowledge + 4 ⋅ skill + .

CRAWFURD AnD ROLLESTOn
An alternative approach would have been to look at the effects of teacher characteristics on student learning directly. While this might give similar coefficient estimates, we are also interested in the predictive power (R 2 ) of teacher characteristics on just the component of student learning that is affected by teachers, which is what we are estimating here.

| RESULTS
First, we estimate teacher effects using OLS (Equation 1). The standard deviation of Grade 5 teacher effects is 0.298 for Ethiopia and 0.282 for Vietnam, after applying a Bayesian shrinkage factor and controlling for school fixed effects (see Table 4). They are found to be in line with previous findings from similar contexts in India, Ecuador, and Uganda. These estimates are reported in Table A1.
We then present the results of the second-stage OLS regression of later student test scores on earlier teacher effectiveness (Equation 2). A 1 standard deviation (σ) increase in teacher quality results in an overall improvement after 5 years of 0.08σ in the pooled sample, 0.09σ in Vietnam and 0.18σ in Ethiopia. Overall these effects are around a quarter of the size of the immediate 1-year teacher effects. In reading, a 1σ increase in teacher quality results in an improvement after five years of 0.06σ in the pooled sample, 0.05σ in Vietnam and 0.11σ in Ethiopia. However, the results for the two sub-samples are not statistically significant (see Tables 5 and 6 for the results by country). We also show the results graphically in Figures 2 and A1. One possible explanation for a greater influence of (previous) teachers in math than in reading is that math is typically mostly learned at school, whereas reading is often learned in a wider range of settings including the home and broader literature environment. Table 7 reports the results over a period of 2 years and finds Note: This table shows the distribution of estimated teacher fixed effects (TFEs), first without school fixed effects and then with school fixed effects. The adjusted standard deviation of TFEs uses the procedure outlined in Aaronson et al. (2007) to account for the uncertainty in our estimates of the teacher effects. no significant effects in either math or reading when including student controls and school fixed effects.

| Heterogeneous effects (differential teacher effectiveness)
Here we examine interactions between teacher quality and baseline student characteristics (see Table 8). Students from wealthy households gain more from better teachers in Vietnam. We do not see statistically significant effects in Ethiopia, though this may be due to a smaller sample. This is in line with Glewwe et al. (2017) who find differential effects for wealthy students in Peru (but not Vietnam) over a shorter time period. More disadvantaged pupils may benefit less from teaching quality, for example, owing to a lack of social or cultural capital required to fully access the curriculum or to benefit from the pedagogical approach. This may be particularly the case for linguistic minorities or groups with large cultural differences from a dominant majority. In our results, students from wealthier backgrounds both perform better on average regardless of which teacher they had, but also benefit more than average from having been previously assigned to a quality teacher (see Table 7 for results after 2 years). Students who are 1σ above average on the household asset index benefit by 0.09σ more from having a quality teacher 5 years earlier than other students. The effect is similar in both mathematics and reading. While it is not straightforward to identify the precise channels through which these effects might operate, linked to the points above, one possibility is that wealthier households are able to provide better ongoing support to education which may help to "sustain" the benefits of good teaching over the longer term. Another possibility might relate to forms of discrimination, whether intended or not, against disadvantaged students by teachers and schools.
Looking at gender, there is no differential effect of having a high-quality teacher for boys or girls. This finding is consistent with the general pattern of high levels of gender equity in educational achievement and progress in Vietnam. In Ethiopia, while gender parity has not been reached, gender gaps are narrowing substantially. Students who are from an ethnic minority in Vietnam perform substantially worse on average, but their reading benefits more than average from having previously been assigned a quality teacher both 2 (see Table 7) and 5 years ago. Ethnic minority students, unlike majority Kinh, often do not speak Vietnamese at home, so the importance of school and teaching in their learning of Vietnamese may be expected to be greater, which is consistent with this finding. For Ethiopia there is not a clear ethnic minority group. In Amhara and Addis Ababa over 95% of our sample speak Amharic, and 100% speak Tigrinya in our Tigray sample. There is an Amharic-speaking minority in our Oromiya sample, and three minority language groups in the SNNP, but the sample of these groups is small. We examine possible effects of effective teachers on a number of non-cognitive measures collected in the Young Lives surveys, reported in Tables 9 and 10. No significant teacher effects on these outcomes are detected. F I G U R E 2 Distribution of test scores 5 years after grade 5 teacher assignment. This figure presents the distribution of student mathematics test scores five years after Grade 5, for students with a bottom quartile Grade 5 teacher, and those with a top quartile Grade 5 teacher Note: This table presents a regression of 2013 student test scores on their teacher quality (value-added, VA) from 1-2 years prior. The dependent variable in each model is the z-score of the percentage of items correct on that test in the Round 4 survey. Teacher VA is estimated using the school survey data, and refers to the estimated effect of the teacher on class test scores at that time (i.e. 1-2 years before the Round 4 test). Student controls in this regression include prior math, reading, and Peabody Picture Vocabulary Test scores (prior to exposure to the teacher in question), prior household wealth, sex, and ethnic group. .169

T A B L E 7 Effects of teacher quality on test scores after 2 years
.168 Note: This table presents a regression of 2017 student test scores on their teacher quality (value-added, VA) from 5 years prior. The dependent variable in each model is the z-score of the percentage of items correct on that test in the Round 5 survey. Teacher VA is estimated using the school survey data, and refers to the estimated effect of the teacher on class test scores at that time (i.e. 5 years before the Round 5 test). Student controls include sex, ethnic group, asset index, and whether they board at school. *p < .1; **p < .05; ***p < .01.

| Correlates of teacher quality
We proceed to collapse the data to the teacher level in order to examine which teacher and classroom characteristics correlate with teacher effectiveness. Here we see that few characteristics of the teachers themselves are strongly correlated with performance (see Table 11). This is consistent with the literature in general, which finds that observed characteristics of teachers have typically weak explanatory power in educational production function studies (see Glewwe et al., 2020). We include teacher age, an asset index, 5 years of experience, higher education (a postgraduate degree in Vietnam and a post-secondary diploma or higher in Ethiopia), whether teachers have a math specialization, a teacher self-efficacy index, 6 and whether they are on a permanent contract. The classroom-average student asset index is correlated with teacher performance if we do not control for school fixed effects, but not if we do (indicating that there is sorting of wealthier students to good schools, but not to good teachers within schools). Note: This table presents a regression of later student aspirations for further education and expected income on their teacher quality (value-added, VA) from Grade 5. Teacher VA is estimated using the school survey data, and refers to the estimated effect of the teacher on class test scores at that time. Student controls in this regression include prior math, reading, and Peabody Picture Vocabulary Test scores (prior to exposure to the teacher in question), prior household wealth, sex, and ethnic group.

| CONCLUSION
This paper has shown that the impacts of effective teachers are persistent. Having a "better" Grade 5 teacher results in better test scores in Grade 10, after students have graduated to middle school. Effects are larger in mathematics than in reading and are not significant for non-cognitive outcomes such as grit, aspirations, and subjective well-being. Measured teacher characteristics bear little relation to estimated teacher quality. How much is this worth in economic terms? Chetty et al. (2014b) estimate the value of an effective teacher (specifically of replacing a teacher in the bottom 5% of the value-added distribution with one of average quality in value-added terms) in the USA to be around $250,000 per classroom over a teacher's career. This is based on the increase in the group of students' lifetime incomes associated with test score improvements which are expected from more effective teaching in this illustration. Estimates of the rate of return to higher skills in Vietnam suggest that a 1 standard deviation higher reading test score is associated with 15% higher earnings overall, and 6% after controlling for schooling (Valerio, Puerta, Laura, Monroy Taborda, & Tognatta, 2016). Our results indicate that having a 1 standard deviation better teacher is associated with persistent 0.1 standard deviation better test scores, which should therefore equate to 0.6% higher earnings. Gross domestic product per capita in Vietnam is around $6,000, so a 0.6% increase is worth $36 per year per student. Each class of 20 students taught by a better teacher might therefore be expected to earn a cumulative total of $720 in addition per year. The benefits potentially go far beyond those directly evaluable in economic terms, of course. Higher education is a postgraduate degree in Vietnam and a post-secondary diploma or higher in Ethiopia. Self-confidence is a standardized (to mean 0 and standard deviation 1) index based on the first principal component of responses to a series of statements on self-efficacy, such as "I can get through to the most difficult students" and "I can get students to work well together". Student wealth is a standardized (to mean 0 and standard deviation 1) index of the average socioeconomic status of all individual students in their class (each being based on the first principal component of a list of household assets).
Improved educational outcomes are associated with a very wide range of social benefits from reduced fertility to improved civic participation. Potential policy implications are numerous, while it is not possible to "read off" implications given the uncertainty about the mechanisms involved. Nonetheless, evidence on wide variation in teacher effectiveness and on the long-lasting impacts of teacher effectiveness points towards the need for careful attention to effectiveness concerns when recruiting, training, deploying, rewarding, promoting, and managing teachers. Policies which improve teacher effectiveness at scale have the potential to bring extensive benefits, while these are likely to be somewhat context dependent.
One possible policy implication is that the importance of within-school variation in teacher quality raises questions about the potential for parental choice to lead to school improvement. Even if parents choose schools, they do not choose teachers. More broadly, our results re-emphasize a neglected point about important and consequential differences in teacher effectiveness that are not well captured by a teacher's level of qualification. The results therefore have implications for how governments should think about recruitment and performance management. However, further details on such policies are beyond the scope of this paper. In this table teacher characteristics (experience, education, socioeconomic status (SES), sex, and contract status) are regressed on prior student characteristics (sex, age, SES, meals, and lagged test scores). This shows some evidence for non-random matching of students with teachers, as students with better prior test scores are more likely to be matched with teachers with higher levels of education.