The achievement gap: The impact of between‐class attainment grouping on pupil attainment and educational equity over time

Despite extensive research on attainment grouping, the impact of attainment grouping on pupil attainment remains poorly understood and contested. This paper presents evidence from a study conducted with 2944 12–13 year olds, from 76 schools in England, who were allocated to between-class attainment groups (‘setting’) in English and mathematics over the first 2 years of secondary schooling. After controlling for prior attainment, pupils in the top set performed significantly better than pupils in the middle and bottom sets in both English and mathematics. The findings indicate a widening gap in attainment, especially in the case of English. Findings, especially in the case of mathematics, provide more evidence of a relative benefit for pupils placed in top sets than a relative detriment for those in bottom sets.


INTRODUCTION
Few topics in education have generated such controversy or longstanding study as grouping by 'ability' ('tracking'). 1And in spite of what Steenbergen-Hu et al. (2016) characterise as a century of research on this topic, the impact of grouping by prior attainment-and especially, of different methods of grouping-remains contested.This can to some extent be explained by the problematic nature of the existing research literature.There is a scarcity of contemporary work focused on pupil-level outcomes; and different types of grouping are often conflated within the meta-analyses and syntheses that have predominated in the field (Francis et al., 2020), making it difficult to draw clear conclusions.Many policymakers and practitioners believe attainment grouping to be effective (see Francis et al., 2017), and practices of ability grouping (tracking)-including between-school grouping, between-class grouping (e.g.setting) and within-class grouping-are prevalent in many systems internationally (Jerrim, 2019;OECD, 2016).In England, grouping by attainment is widespread in both primary and secondary schools, with 37% of 6-7 year olds placed in attainment groups for either literacy or numeracy (Hallam & Parsons, 2012), and more than 70% of secondary schools placing 11 year olds in attainment groups for mathematics (Taylor et al., 2020).
This paper provides timely new empirical evidence on the impact of the contentious practice of grouping by attainment on pupils' attainment outcomes.Specifically, we use data from an experimental study of well-defined practices to examine the different effects on those placed in top and bottom sets.We provide up-to-date evidence about the impact of setting in the United Kingdom.This is important because very few large-scale studies have been carried out in the United Kingdom and because the practice of setting is very different to that of the United States, where the bulk of the research has been conducted.We highlight issues raised both for social justice in teaching practice and for future research.

Contrasting theories about attainment grouping
Proponents of attainment grouping argue that placing pupils in more homogenous classes enables teachers to better tailor the curriculum and pedagogy to those pupils, and, hence, is more efficient and effective for all pupils (e.g.Hallinan, 1994;Rosenbaum, 1999).Many school leaders believe that within-school grouping has benefits for all pupils, including those with low prior attainment.Indeed, a study conducted in England by Macleod et al. (2015) found that more than a third of schools surveyed had 'introduced or improved' setting as a way of raising attainment for disadvantaged pupils.
On the other hand, critics of grouping by attainment point to analyses showing the practice may not be as efficient as hypothesised by its proponents, and may in fact increase educational inequity.For example, Hanushek and Wößmann's (2006) analysis of international comparative tests in mathematics, reading and science suggests that educational systems that adopt early grouping by attainment tend to have a widening gap in attainment over time, thus increasing educational inequity, and may additionally be associated with an overall decrease in mean attainment in comparison to other systems.Similarly, evidence from PISA 2012 suggests a relationship between grouping by attainment within schools and the share of low and top performers in an education system, concluding from their findings that 'more ability grouping within schools is related to a greater number of low performers in mathematics, and fewer top performers' (OECD, 2016, p. 186).Evidence from many observational studies in the United States, Germany and the Netherlands, as well as the United Kingdom, suggests that ability grouping is associated with increased inequity on educational outcomes (e.g.Berends & Donaldson, 2016;Borghans et al., 2020;Capsada-Munsech & Boliver, 2019;Gamoran & Mare, 1989;Matthewes, 2021), although there is some dispute as to whether this widening gap reflects a benefit for those pupils placed in high sets, a disbenefit for those placed in low sets, or both (e.g.Betts & Shkolnik, 2000).
Longstanding research demonstrates that pupils from low socio-economic groups, and from certain minority ethnic (typically Black) backgrounds, are disproportionately likely to be found in low-attainment tracks and groups, whereas White pupils from affluent families are over-represented in high-attainment groups and 'academic' tracks (e.g.Bosworth, 2013;Moller & Stearns, 2012;Muijs & Dunne, 2010;Strand, 2012).Recent research in England bears this out (see Archer et al., 2018;Connolly et al., 2019).Research has also shown that pupils from socially disadvantaged backgrounds are disproportionately misallocated to low-attainment groups (Dunne et al., 2011;Jackson, 1968), compounding existing inequalities at the start of schooling (e.g.Waldfogel & Washbrook, 2010).This over-representation of pupils from disadvantaged backgrounds in low sets (and more affluent pupils in high sets), coupled with ongoing attainment gaps and the hypothesis emerging in numerous studies that attainment grouping practices cause inequitable educational progress, has led attainment grouping to be frequently seen as a matter for social (in)justice in education.
Within this body of work, it is argued that the inequitable outcomes for different attainment groups may be due to several factors such as differences in teacher expectations, teacher quality, curriculum content and opportunity to learn, as well as pupils' self-confidence and motivation (e.g.Francis et al., 2017;Oakes, 1995).Research does indicate that these arguments are to some extent justified.Teachers' expectations do appear to be lower for those pupils placed in lower-attaining groups (Campbell, 2014(Campbell, , 2017;;Ireson & Hallam, 2009;Timmermans et al., 2015).There is evidence that lower sets tend to be allocated teachers with less subject-specific expertise or less experience (Francis et al., 2019;Kelly, 2004;Papay & Kraft, 2015).Lower-attaining groups do appear to be taught a reduced curriculum offer (Hallam & Ireson, 2005;Jaremus et al., 2020;Wilkinson et al., 2020), offered fewer opportunities for participation and discussion (Gamoran et al., 1995) or conceptual understanding (Martinková et al., 2020), and have restricted opportunities to progress (Buttaro & Catsambis, 2019), while studies have also found a relationship between pupil self-confidence and attainment grouping (Francis, Craig et al., 2020;Houtte et al., 2012;Ireson & Hallam, 2009;Muijs & Dunne, 2010).

Between-class attainment grouping ('setting') and pupil achievement
Given these contrasting theories, attainment grouping practices remain a strong point of interest and contestation within educational practice and research.In this paper, we focus on setting, a particular form of between-class grouping, which is prevalent in English secondary schools (Taylor et al., 2020).Setting is where pupils are grouped by subject attainment for teaching in that subject, and is to be distinguished in England from streaming, where pupils are grouped by general ability for teaching across a majority of subjects (Ireson & Hallam, 2001).There are also many other different forms of grouping by attainment described in the literature, including between-school grouping (tracking), within-class grouping and acceleration for high-attaining pupils (see Francis, Taylor et al., 2020 for elaboration).
Our focus on between-class attainment grouping, or setting, is for two reasons.Research syntheses suggest that the various different forms of attainment grouping may have statistically significant different sizes, and even directions, of overall effect (Higgins et al., 2018;Steenbergen-Hu et al., 2016), and that particular grouping practices may impact on different groups of pupils in different ways (Rui, 2009).In addition, between-class grouping is widely used in educational systems internationally (Jerrim, 2019).
The impact of attainment grouping on pupils has been the subject of extensive research, and a large number of literature reviews and meta-analyses synthesise the findings on the topic.These syntheses suggest that between-class attainment grouping has no overall benefit to academic attainment, with a small negative impact for low-attaining pupils and a small positive benefit for high-attaining pupils (Higgins et al., 2018;Rui, 2009;Slavin, 1990).On closer examination, the evidence provided by this extensive evidence base is not as robust or as generalisable as this research base would suggest.
Research specifically examining between-class grouping at secondary level in the United Kingdom is mostly from small-scale studies (Boaler, 1997;Ireson & Hallam, 2001;Wiliam & Bartholomew, 2004).Notable exceptions are studies by Kerckhoff and by Ireson. Kerckhoff (1986) drew on British birth cohort data to analyse the impact of within-school attainment grouping on the achievement of pupils who attended secondary schools in the 1970s, at a point when the educational system was very different to today and particularly so for low-attaining pupils (Hodgen et al., 2022).The findings indicated a widening attainment gap for schools that used attainment grouping compared to those that did not.In the only other large-scale study carried out in the United Kingdom, Ireson and colleagues examined the effects of between-class grouping in comparison to mixed attainment on the achievement of a cohort of pupils from 45 schools who took attainment tests at age 14 in 2000 (Ireson et al., 2002(Ireson et al., , 2005) ) and national GCSE examinations at age 16 in 2002 (Ireson et al., 2005), focusing on three subjects: English, mathematics and science.However, their results were mixed and inconclusive.For example, at age 16, they found no effect for setting, although pupils of equivalent prior attainment performed better in all three subjects when placed in higher sets.Ireson et al.'s (2005) study reports data that are now more than 20 years old and, aside from the need to replicate individual studies (Makel & Plucker, 2014), there is a need to provide up-to-date evidence about the effects of setting.
There are a large number of meta-analyses examining international evidence on the topic (e.g.Kulik & Kulik, 1992;Lou et al., 1996;Slavin, 1990).However, in synthesising different sets of studies, these meta-analyses report effects of attainment grouping that vary from d = −0.45(Slavin, 1987) to d = 0.19 (Kulik & Kulik, 1984).In an attempt to produce a definitive answer on the issue, Steenbergen-Hu et al. (2016) conducted a secondary meta-analysis in order to review and synthesise the large number of primary meta-analyses on the topic of attainment grouping.They identified no fewer than 11 primary meta-analyses that examined the effects of between-class grouping by attainment and found no statistically significant effect for the practice, either overall or for pupils of high, middle or low attainment.However, these 11 primary meta-analyses were all based on dated original studies; the most recent being published in 1991 and most carried out in the 1960s and 1970s, at a time when statistical methods were much less sophisticated than those currently available, and did not take account of clustering of pupils within classes through approaches such as multilevel modelling (Connolly et al., 2017;Hedges, 2007).Moreover, this was a period when the reporting requirements for experimental studies were relatively weak, since this predated initiatives to pre-register trials and experiments (Styles & Torgerson, 2018).It is likely that, for many studies, attainment grouping was combined with guidance on practice, professional development and/or curriculum adaptation to match different attainment levels.Indeed, it may be that the structural effects of between-class grouping are mediated through teaching quality and opportunity to learn.But the contribution of these elements was not considered in any of the primary meta-analyses through now standard techniques such as moderator analysis or meta-regression.This may be because few of the original studies provide any details on these aspects.Educational practices (and even teaching qualifications) were also very different from the present day.In England, for example, compulsory schooling ended at age 15 until 1972 and many pupils left education without formal qualifications (Gillard, 2018).
The vast majority of these original studies were carried out in the United States and, indeed, the debate around grouping by attainment has largely been framed in terms of the US practices of 'tracking' versus 'detracking' (see e.g.Loveless, 1999), practices that have been treated as synonymous with the practices of setting versus mixed 'ability' teaching in England (Abraham, 2008;Wilkinson & Penney, 2014).In fact, as Domina et al. (2019) show, tracking involves a range of sorting practices that reflect particularities of the American educational system, and tracking, as practised in the United States, is often closer to 'streaming' rather than 'setting' (Wilkinson & Penney, 2014).Hence, the findings of US studies may not generalise to different educational systems and contexts such as England.
Steenbergen-Hu et al. ( 2016) also conducted a primary meta-analysis that only included those they selected as the 'highest quality' original studies, randomised controlled trials (RCTs) where the full text was available.In contrast to the secondary meta-analysis, this found a positive effect for between-class grouping (g = 0.15, 95% CI: 0.01-0.29).However, this result was based on just five dated studies, published between 1962 and 1974, all of which were conducted in the United States.In addition, all five were small-scale interventions, with four of the five studies each conducted in just one school and the fifth in just four schools, and none of the studies used methods that took account of the clustering of pupils within classes (or schools).
The most recent primary meta-analysis (Rui, 2009), which is not included amongst those synthesised by Steenbergen-Hu et al., found attainment grouping had a negative impact on low-attaining pupils, but no effect on middle or high-attainment pupils.Rui's meta-analysis synthesised the results of just 15 studies, all conducted in the United States.Unfortunately, Rui's analysis aggregates the results of both experimental and observational studies, including just four RCTs published between 1972 and 1996.Furthermore, although Higgins et al.'s (2018) secondary analysis suggests that the effects of between-class and within-class grouping are in different directions, Rui does not distinguish between these two forms of grouping, thus conflating their effects.
None of the above take into account additional factors, such as curriculum and quality of teaching, that might influence the impact of attainment grouping.Only very recently have researchers started to carry out quantitative studies of teaching in relation to attainment grouping and pupil outcomes.Magableh and Abdullah (2021) conducted a small-scale experimental study of differentiated instruction in mixed-attainment classes, finding that differentiated teaching resulted in higher outcomes for pupils.Wang et al. (2021) explored the impact of teacher support on outcomes for pupils tracked into three different school bands in Hong Kong.They found that teacher support mediated the higher English and mathemat ics attainment of pupils in high-band schools and also moderated the English attainment of pupils in low-band schools.However, the context of these two studies is different, focusing on within-class differentiation and between-school tracking, respectively.Furthermore, 'teacher support' differs from 'teaching quality' and perhaps is more analogous to a supportive climate, or high expectations.No quantitative studies yet focus explicitly on the quality of teaching in schools using between-class grouping.
In summary, the limitations highlighted above indicate a need for robust, contemporary studies of specific between-class grouping practices and their outcomes that establish or contest the somewhat fragile conclusions described above.Especially, there is a need to provide up-to-date evidence and to investigate the effect of between-class grouping, or setting, as it is practised in systems like England.Our analysis seeks to do this, exploring the relative attainment outcomes of pupils placed in different attainment sets (between-class attainment groups) over the first 2 years of secondary school in the core subjects of English and mathematics, using a large, robust and representative sample of schools in England.

METHOD
The data discussed in this paper draw on data from a large-scale mixed-methods project 'Best Practice in Setting', funded by the Education Endowment Foundation.Specifically, it analyses data collected during a cluster RCT of the 'Best Practice in Setting' intervention.As already noted, 'setting' is an especially prevalent form of between-class attainment grouping in England (Taylor et al., 2020), comprising tracking by subject.In principle, a pupil might be placed in a high set for several curriculum subjects and in low sets for others, depending on their respective prior attainment in disparate subjects.In practice, setting is sometimes mixed with, or layered upon, other tracking practices such as streaming (see Francis, Taylor et al., 2020 for a discussion).The project sought to address prior gaps in the literature, by exploring: whether practice in setting that remediates some of the problematic practices identified in the literature as affecting those in low groups might improve young people's progress; what comprises good practice in mixed attainment pedagogy; and the experiences and outcomes of pupils subject to attainment and mixed-attainment grouping.It consisted of a 2-year intervention comprising guidance for schools on how to group pupils and allocate teachers to classes, and professional development focusing on high expectations for all pupils and flexible conceptions of 'ability' (Roy et al., 2018).The intervention was tested by a fully powered RCT examining the impact or otherwise of practice in setting pupils for English and mathematics in Year 7 and Year 8 based on research evidence.In addition, there were surveys of 13,462 pupils and 597 teachers, and individual and focus group interviews with 246 pupils and 54 teachers, although results from these data are not reported in this paper.
The intervention and research were undertaken in 126 secondary schools in England (divided into intervention or control groups), and involved instigating work with and monitoring pupil cohorts from the beginning of Year 7 (11-12 years old) to the end of Year 8 (12-13 years old), the first 2 years of English secondary schooling.The study focused on their experiences and outcomes in English and mathematics, which were selected as the foci because: (a) they are two subjects given longstanding priority in the national curriculum and within-school performance indicators; and (b) they represent diversity in content and pedagogy.
The trial was conducted by an independent evaluation team who were responsible for the trial design, school recruitment, randomisation, pre-specification and registration of the trial (Roy & Styles, 2017), as well as the administration and marking of the primary outcome attainment tests.The intervention was developed and delivered by the programme delivery team, including the authors of this paper, who were also involved in the school recruitment, quantitative and qualitative data collection, and supporting relationships with schools throughout the trial.The study was approved by the Research Ethics Committee of King's College London and Queen's University Belfast and, later, UCL Institute of Education. 2 This paper analyses the differential impact of setting on attainment outcomes for pupils placed in different set levels across the 2 years of the intervention.To be clear, where the RCT compared attainment outcomes between the intervention and schools maintaining 'business as usual' setting, this paper explores the impact of setting per se on the outcomes of pupils in different attainment groups.

The sample
To be eligible for the trial, schools had to use subject-based between-class grouping by attainment (not streaming) and to have at least three set levels for each subject (top, middle and bottom).Schools were recruited to the 'Best Practice in Setting' trial through a mixture of volunteer and direct 'cold call' approach sampling, then randomised to the intervention and control groups of the RCT.Volunteer-sampled schools were recruited through a traditional and social media campaign by the authors.Direct approach-sampled schools were identified through a stratified random sample then approached by the independent evaluation team (see Roy et al., 2018).Of these 126 schools, 121 took part in the mathematics trial and 79 took part in the English trial.However, there was considerable dropout of participant schools during the duration of the 2-year trial, and a significant portion of schools did not deliver the final outcome tests in English and mathematics.Hence, the achieved sample consisted of 73 schools in mathematics and 35 schools in English.Since some schools took part in both subjects for the trial, there was a total of 76 schools in the achieved sample.
The overall characteristics of the pupils and schools in the mathematics and English samples are summarised in Table 1.Demographic data in relation to gender, household background, free school meals entitlement, ethnicity and set allocation are provided for the 2236 pupils in the mathematics trial and 919 pupils in the English trial. 3The samples are reasonably reflective of the national population.In particular, it can be seen that the sample is well balanced in terms of gender and also broadly representative of the national population in relation to ethnicity [where it is reported that, nationally, 76% are White, 10% Asian, 6% Black and 5% mixed; see DfE (2015, p. 15)].The sample is also broadly representative in relation to the proportion of disadvantaged pupils, with 30.6% of the present sample having been eligible for free school meals (FSM) at some point, compared to the nationally reported figure of 32% (DfE, 2015, p. 14). 4  It is also noteworthy that there is a large amount of missing data on ethnicity and household socio-economic status (SES).This is because a large proportion of pupils chose not to provide these data: in the mathematics trial, 35% and 42% did not provide data on ethnicity or SES, respectively, and, in the English trial, 38% and 45% did not provide data on ethnicity or SES, respectively.The two samples in each subject, with and without attrition, were broadly similar.See Table S1 for further details.We address the issue of missing data further in the analysis section, below.
The overall characteristics of the schools are broadly reflective of the national population of state-funded, non-selective schools (see Connolly et al., 2019).The proportions of OFSTED grades across schools are generally representative of the national picture (in 2015) of 22% outstanding, 56% good, 17% requires improvement and 5% inadequate (OFSTED, 2016, p. 133), although we note that the English sample is slightly skewed towards poorer performing schools.

BETWEEN-CLASS ATTAINMENT GROUPING AND THE ACHIEVEMENT GAP
This sample of schools was recruited for the purpose of the trial and, as such, had expressed some interest in adopting 'best practice' in attainment grouping.Hence, whilst not fully representative, the sample may be considered a 'telling case' in that these schools T A B L E 1 Sample characteristics might be expected, if anything, to be more interested than other schools in increasing equity across attainment groups (Mitchell, 1984).
For the purposes of this analysis and for comparability with previous analyses (Connolly et al., 2019;Francis, Craig et al., 2020), we have combined the intervention and control group schools, which is justified because no significant effect was found for the intervention for either subject (see Roy et al., 2018).

Outcome measures
At post-test, attainment was measured using the paper versions of the Progress in English (PTE13) and Progress in Mathematics (PTM13) tests, which are standardised tests produced and validated by GL Assessment (2015a, 2015b).The independent evaluation team conducted the post-tests.They drew a random sample of 30 pupils in each school participating in the mathematics trial to complete the outcome test in mathematics and a random sample of 30 pupils in each school participating in the English trial to complete the outcome test in English.

Pre-test measures
Pupils' Key Stage 2 (KS2) national assessment results for mathematics and English (DfE, 2015) were used for pre-test measures of attainment, and were collected at the beginning of the school year in September 2015 through the National Pupil Database, as the pupils began Year 7. Full decimalised KS2 'fine points' scores (rather than simply levels) were used.Outcome attainment was measured at the end of the following academic year as pupils completed Year 8, after two intervening years of schooling, in June 2017.

Household socio-economic status
Household socio-economic status data were collected via questions on a pupil survey concerning parental/carer occupation, with categorisation according to the highest-status occupation between parents.Following this analysis (and given longstanding difficulties in judging the nature and content of some occupations), the tiered occupations were further categorised into three categories, higher, intermediate and lower, corresponding to the ONS three-class model (ONS, n.d.).

Set level
Schools in our sample varied in relation to the number of set levels they applied, from two to ten, with most falling between three and five (intervention schools in the setting trial had been specifically asked to cap the set level number at four).For the purposes of this current analysis, pupils were coded into three groups for English and mathematics, respectively, in each school: those in the very top set; those in the middle set(s); and those in the very bottom set.Thus, for a school with four sets, the top set was coded '1', the middle two sets coded '2' and the bottom set coded '3'.Similarly, for a school with five sets, the top set was coded '1', the middle three coded '2' and the bottom set coded '3'.The breakdowns of the sample by these three categories for English and mathematics are also shown in Table 1.

Analysis
The data were analysed in Stata 17.0 (StataCorp, 2021) by fitting a series of three multilevel models in each subject, mathematics and English, with pupils (level 1) clustered within individual subject sets (level 2) and then within schools (level 3).In each model, dummy variables representing the three categories of set level (top, middle and bottom) were included, along with other covariates representing pre-test attainment (KS2 in mathematics and English, respectively), gender, allocation to the intervention and total number of sets within the school.The principal model for each subject, M1, also included household occupation (SES) and ethnicity as covariates.However, as already noted, there was a large amount of missing data in these two variables.To investigate the effect of these missing data, we used two approaches.First, we ran two further models in each subject, M2 and M3, to assess the sensitivity of the results of the primary model, M1.Model M2 excluded household occupation (SES) and ethnicity as covariates and was based on the entire sample of pupils.Model M3 also excluded household occupation (SES) and ethnicity as covariates, but was based on the samples of pupils with complete data (i.e. the same dataset as for M1).Second, we used multiple imputation to impute the missing data for household occupation (SES) and ethnicity, then re-ran the principal model on the imputed dataset to compare this with the complete case analysis, M1.Our assumption is that robust, practically significant effects would not be sensitive to changes in the modelling.
The models were then used to estimate the adjusted mean attainment scores for pupils in the three set levels, controlling for these covariates.Practically, this was done by adding in a series of values to the model.These values consisted of either: the relevant values of the dummy variables for the set levels (i.e.either '0' or '1'); or the mean scores for each of the other covariates included in the model; or '1' for the constant.The standard deviations for each of the mean scores estimated were calculated using the standard error of the associated null model multiplied by the square root of the sample size to account for the clustered nature of the data, and the size of each subsample represented the total number in each category for whom there were full data (and thus whose data were included in the model).
Standardised effect sizes were calculated using Hedges' g.To account for the effects of clustering, 95% confidence intervals were calculated using the standard errors of the regression coefficient and transformed into an effect size to produce the upper and lower bounds of the effect size from the model.

RESULTS
A summary of the results for the main models, M1, M2 and M3, showing the effects on pupil attainment after experiencing setting for two school years, from the hierarchical regression models, is shown in Table 2, for English, and Table 3, for mathematics.A summary of the results of the imputation models is provided in Table S1.
The findings show that after 2 years, there was a statistically significant increase in the attainment level for pupils in the top set when compared to the middle set(s) in both subjects, and this effect was robust across all three models and also for M1 on the imputed dataset.However, the effect is much larger for English than for mathematics, where the effect of prior attainment at KS2 is comparatively very much larger.The finding of lower attainment for pupils placed in the bottom set for English when compared to those in the middle set was not of a consistent size across the models and was statistically significant in only two of the models, M1 and M3, but not for the third model, M2, based on the entire dataset including those pupils with missing SES and ethnicity data, nor for M1 on the imputed dataset.Hence, whilst the attainment of those placed in the bottom set for English is lower than those in the middle set, this effect was not robust across all models and the significant results for models M1 and M3 may have been subject to bias due to missing data.The attainment of pupils in the bottom set for mathematics was lower after 2 years compared to the middle set, although this effect was relatively small and not statistically significant in any of the models, and the imputation analysis showed an effect very close to zero.Hence, despite some negative trends, we found no evidence to indicate that the attainment of those in the bottom set decreased significantly relative to similar pupils placed in the middle set.The effect sizes for attainment of pupils in the top and bottom sets compared to the middle set for both subjects and for all three models are summarised in Tables 4 and 5 and illustrated graphically in Figures 1 and 2.
It can be seen from Table 4 and Figure 1 that, in mathematics, the relative increase for pupils placed in the top set compared to those in the middle set after controlling for prior attainment is consistent across the three models at g = 0.1.Table 4 and Figure 2 show that, in English, the relative increase for pupils placed in the top set compared to those in the middle set is also consistent across the models, but is almost three times as large at around g = 0.27.
In summary, when controlling for prior attainment, pupils in the top set performed significantly better than pupils in the middle and bottom sets in both English and mathematics, and these effects were larger for English than for mathematics.However, our data suggest that pupils placed in the bottom set for English performed slightly worse than pupils placed in the middle set, although this trend was not statistically significant.In other words, our models indicate a widening gap in attainment, but provide more evidence of a relative benefit for pupils placed in top sets compared to all other pupils, rather than a relative disbenefit for those in bottom sets.In addition, our models suggest the effect is larger for English than mathematics.

DISCUSSION
Our study provides up-to-date evidence from a large-scale study in England to show that setting, between-class grouping by subject, is associated with positive impacts on pupils placed in high sets in comparison to those placed in middle and low sets, after controlling for prior attainment.This finding is broadly in line with Ireson et al.'s (2005) now dated results from the early part of this century.In other words, in our study, a pupil who was allocated to a high set tended to make larger gains than a pupil of similar prior attainment who was placed in a middle or low set.Our study provides stronger and more robust evidence for placement in a top set as a key factor in increasing pupil attainment.Additionally, in contrast to Ireson et al., who found similar effect sizes across subjects, we found a much larger effect for English in comparison to mathematics.Before examining the implications of these findings, there are two important points to make.First, our results do not indicate that setting benefits high-attaining pupils.Rather, they show that setting benefits those pupils who are placed in higher sets.There is a great deal of evidence highlighting how pupils are misallocated to high and low sets, and this results in the over-representation of pupils from Black and minority ethnic backgrounds in lower sets (e.g.Connolly et al., 2019) and pupils from socially disadvantaged backgrounds in lower sets (e.g.Kutnick et al., 2005).Hence, in benefitting pupils allocated to top sets, this disadvantages those pupils misallocated to middle or low sets.Second, our results indicate a relative advantage for pupils placed in top sets, but they do not show that these pupils performed better than they would otherwise have done in a class of mixed attainment.
These findings are of concern from educational and social justice perspectives.They illustrate a growing attainment gap, and divergence between top-set pupils in comparison with pupils in middle and bottom sets.This self-fulfilling prophecy (Merton, 1948) affecting attainment and pupil self-confidence (Francis, Craig et al., 2020;Francis, Taylor et al., 2020) F I G U R E 2 Post-test mean gains (with 95% confidence intervals) in attainment by set level (compared to middle set) for all four models in English.may be due to a Pygmalion effect (Rosenthal & Jacobson, 1992), specifically for those pupils assigned to top sets, who receive more teacher encouragement and higher expectations (cf.Wang et al., 2021).Alternatively, it may be that pupils in top sets are offered a richer curriculum with much greater opportunity to learn (Burris et al., 2006).Or it may be that top sets are allocated better qualified and more experienced teachers (Francis et al., 2019).
This widening gap is of concern to educationalists, as failing to promote the educational thriving and effective learning for pupils that all educational professionals intend.It is also of concern to policymakers.The United Kingdom is famously dogged by a 'long tail' of underachievement (Marshall, 2013), and our findings provide a clear potential explanation, given the prevalence of within-school tracking in our system (Taylor et al., 2020).Moreover, our findings also highlight that, in spite of the envisaged equality of entitlement to high-quality educational provision facilitated by comprehensive state education, provision is inequitable, with some pupils advantaged and others disadvantaged.
But our findings also have implications for interventions directed at addressing disadvantage in education.For pupils placed in top sets, the effect sizes that we found are of the order, and for English larger, than are identified in most educational trials (see e.g.Cheung & Slavin, 2016).In addition, the effect sizes for low set placement in English, whilst not statistically significant or consistent across all three models, were nevertheless negative and at least of the order of those identified in most educational trials.In mathematics, the effect sizes for low set placement were small, but nevertheless negative.As we have highlighted, socially disadvantaged pupils (and those from certain minority ethnic groups) are over-represented in, and often misallocated to, lower sets (Connolly et al., 2019).And yet, as we noted earlier, many schools use attainment grouping as one element of a strategy to address educational disadvantage (Macleod et al., 2015).Our results suggest that, especially in English, this may be at best counter-productive and that, despite the best efforts of schools, the effects of attainment grouping may counteract the effects of genuinely beneficial interventions.
The findings of greater significance for setting in the case of English for pupil outcomes (positive and negative) also suggest that: (a) there may be different impacts of setting for different curriculum subject areas, demanding further research in this area; and (b) schools concerned with equity should review setting in English.Interestingly, setting is somewhat less prevalent in English compared to mathematics, in England.
There are three limitations with our study.First, there was no control group in which a different form of grouping practice, such as mixed attainment, was used.Hence, we cannot be certain whether the effects on attainment are either caused or exacerbated by setting, nor can we say whether setting resulted in higher attainment for those placed in top sets than would otherwise have been the case.Nevertheless, our findings are in line with much of the previous literature in that they do strongly suggest that attainment grouping is associated with a widening attainment gap, which is due to a relative, but not necessarily an absolute, advantage for those pupils placed in top sets.Second, there was significant attrition of schools from the study, and the remaining schools could be atypical and committed to good practice and equity, given (a) their original voluntary participation in a study focused on best practice in setting and (b) their dedicated completion of the 2-year period of study.Nevertheless, they reflect a national sample, and any atypicality as 'conscientious schools' might be postulated to have mitigated the trends identified, rather than exacerbating them.Third, no measures of teaching quality or opportunity to learn were applied and, hence, we cannot say whether the observed effects are due to setting per se or are a result of the effect of setting on teaching quality or opportunity to access curriculum content.
Finally, our results highlight important issues for further research into the effects of setting in different subjects.There is also an urgent need for more robust research into the effects of setting as compared to mixed-attainment grouping and to investigate the relationships between setting, teacher quality, opportunity to learn and attainment.Despite 100 years of research into the effects of ability grouping, the evidence is still inconclusive.It is clear that research in this area is technically, methodologically and practically difficult.Previous studies highlight some of these difficulties.Ideally, one would carry out an RCT comparing pupils from schools randomly assigned into groups with setting or mixed-attainment classes.This is simply not feasible at scale, because the effort-and time-needed to effect a change in attainment grouping across a school is considerable (Taylor et al., 2019).However, naturalistic studies are also problematic.For example, in Betts and Shkolnik's (2000) comparison of schools with and without a school policy to group pupils by 'ability', classes amongst the no-grouping schools were no less stratified than in those schools with a grouping policy.In other words, despite the official policy, informal grouping by attainment was used.In a new study with which some of the authors are engaged (Hodgen et al., 2019), we seek to approach this methodological challenge by comparing carefully selected and robustly matched samples of schools already using different forms of grouping and examining the effects of teacher quality and opportunity to learn.We hope that this study will help answer this important outstanding question.

A C K N O W L E D G E M E N T S
We gratefully acknowledge the following for their assistance throughout the project, without which it would not have been possible: the evaluation team at the National Foundation for Educational Research, particularly Ben Styles and Palak Roy; the Department for Education (DfE) National Pupil Database (NPD) team; the EEF staff, including Anneka Dawson and Emily Yeomans; and, finally, the schools, teachers and pupils who took part in the research.We are also grateful to Jake Anders for advice and comments.The analysis for this paper was conducted whilst Antonina Tereshchenko was at UCL and Paul Connolly was at Lancaster University.

F U N D I N G I N F O R M AT I O N
This research was funded by the Education Endowment Foundation (EEF).

C O N F L I C T O F I N T E R E S T
The authors have no conflict of interest to disclose.Note that, since this research was conducted, Becky Francis has become Chief Executive of the Education Endowment Foundation.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data are not publicly available due to privacy or ethical restrictions.This is due to recent changes in how National Pupil Database extracts can be shared.

E T H I C S S TAT E M E N T
The study was approved by the Research Ethics Committee of King's College London and Queen's University Belfast and, later, UCL Institute of Education (now IOE -Faculty of Education and Society, University College London).

F
I G U R E 1 Post-test mean gains (with 95% confidence intervals) in attainment by set level (compared to middle set) for all four models in mathematics.

647 (1.270) 28.718 (1.025) 30.384 (1.341)
Summary of three multilevel models used to compare post-test attainment by set level for English.Statistically significant results (p < 0.05) indicated in bold Summary of three multilevel models used to compare post-test attainment by set level for mathematics.Statistically significant results (p < 0.05) indicated in bold Summary of effects on attainment by the bottom set for all four models (including imputation) in both subjects: mathematics and English.Statistically significant results (p < 0.05) indicated in bold