PISA 2018 in England, Northern Ireland, Scotland and Wales. Is the data really representative of all four corners of the UK?

PISA 2018 in England, Northern Ireland, Scotland and Wales. Is the data really representative of all four corners of the UK? John Jerrim UCL Social Research Institute April 2021 PISA is an influential international study of 15-year-olds achievement. It has a high-profile across the devolved nations of the UK, with the results having a substantial impact upon education policy. Yet many of the technical details underpinning PISA remain poorly understood – particularly amongst non-specialists – including important nuances surrounding the representivity of the data. This paper provides new evidence on this issue, based upon a case study of PISA 2018. I illustrate how there are many anomalies with the data, with the combination of non-response, exclusions from the test and technical details surrounding eligibility criteria leading to total non-participation rates of around 40% (amongst the highest anywhere in the world). It is then shown how this leads to substantial uncertainty surrounding the PISA results, with clear evidence of bias in the sample for certain parts of the UK. I conclude by discussing how more transparent reporting of the technical details underpinning PISA is needed, at both a national and international level.


Introduction
PISA is an influential international study of 15-year-olds skills in reading, science and mathematics. Conducted every three years since 2000, it has become a widely-watched indicator of national educational performance across the globe. Results from PISA have had substantial real-world impact upon education policy (Baird et al 2011). This includes reforms made to national curricula in South Korea and Mexico, along with alterations to national assessments in Slovakia and Japan (Breakspear 2012). Policy recommendations made by the OECD off the back of the PISA results have also been influential in Wales (OECD 2014) and a wide-range of middle income countries (Lockhead, Prokic-Bruer and Shadrova 2015), along with many other international examples. It is now one of the most influential studies in education, with the triannual results impacting upon the thoughts and actions of key decision makes all around the world.
PISA has also had a notable impact upon education discussion and debates in the United Kingdomthe country of focus in this paper. Since devolution in the late 1990s, education policies, practices and qualifications have diverged across England, Northern Ireland, Scotland and Wales. This has led to questions about how the four nations of the UK compare in terms of young people's educational achievement, and how this has changed over time (Machin, McNally and Wyness 2013). With few other comparable sources of data available, PISA has become the "go-to" resource to conduct comparisons of educational achievement across the UK. Indeed, national reporting of each new round of PISA has an entire chapter devoted to intra-UK comparisons of educational performance (Sizmur et al 2019), with these results then widely reported within the national media (Coughlan 2019). For reference, the latest trends in PISA mathematics scores can be found in Figure 1.

<< Figure 1 >>
Given PISA's now prominent role in our understanding of educational performance across the UK, it is vital that it provides sound and reliable evidence upon which such comparisons can be made. Yet some have questioned various aspects surrounding the reliability of the PISA study, both within the UK and internationally. For instance, investigating trends in England's PISA scores over time, Jerrim (2013) noted how many important changes were made between PISA 2000/2003 and subsequent rounds, affecting both response rates and test month, which may then have impacted upon the results. In the case of Turkey, Spaull (2019) noted how nuances surrounding the PISA eligibility criteria are likely to have a big impact upon the reliability of trends in PISA scores over time. Anders et al (2020) conducted a detailed case study of the PISA 2015 data for Canada, illustrating how a combination of low response rates and high exclusions lead to serious questions surrounding the represenitvity of the data. In the case of Portugal, Pereira (2011) argued that changes to how the sample was drawn had a substantial impact upon changes in the PISA scores over time. Also in Portugal, Freitas et al. (2016) found there to be non-trivial differences between the PISA target population and the final sample, with this then having a notable impact upon changes in PISA scores. Concerns have also been raised regarding the switch between paper and computer assessment in PISA that occurred in 2015 (Jerrim 2016;Jerrim et al 2018) and how this may have affected comparisons of results across countries and over time. Other issues have been raised about the lack of transparency over the item-response theory model used to generate the PISA scores, including a lack of transparency about how the so-called "plausible values" are produced (Goldstein 2017). For instance, Zieger et al (2020) illustrated how subtle changes made to the PISA scaling model can have a big impact upon cross-national comparisons of educational inequality. More generally, questions have been raised surrounding cross-country differences in translation and interpretation of the PISA test material (El Masri, Baird, & Graesser, 2016;Kankaraš & Moors, 2014).
Much of the aforementioned work serves as a platform for this paper. After the UK was disqualified from PISA in 2003due to its low response ratesignificant efforts were made to ensure that data collected in future waves would be more robust. This included moving the test date in England, Wales and Northern Ireland to avoid clashes with GCSE preparation, the appointment of external contractors to collect the data rather than a government department, and introducing legislation that could potentially force schools to participate if they did not willingly comply. Taken at face value, it may seem that this strategy has been successful; no part of the UK has been excluded from PISA since 2003, with the data always being considered by the OECD to be of acceptable quality.
However, in reality, the true situation remains much more nuanced than first meets the eye. As this paper will explain, there continues to be a large amount of non-participation by pupils and schools in PISA across the UK, leading to potential biases in the data. I pay particular attention to the PISA 2018 data for Scotland, where particular complications and anomalies have emerged. In summary, my discussion will illustrate how:  The non-participation rate in PISA across the UK (and in Scotland as an individual entity) is around 40%. This is amongst the highest anywhere in the world.
 This high-level of non-participation means that there is a large amount of uncertainty surrounding the UK's PISA results. This is likely to affect the reliability of comparisons that can be made across the four UK nations, comparisons to other countries and how results have changed over time.
 Key issues regarding data quality and comparability havein my view-not been adequately reported, with greater transparency needed in future rounds.
 There is clear evidence of an upward bias in the PISA 2018 data for England and Wales, with lower-achievers systematically excluded from the sample.
 Average PISA scores in England and Wales would likely be around 10 to 15 points lower, had a truly representative sample of the population taken the tests.
The main aim of this investigation is to help a broader group of interested stakeholders understand such key issues, and to aid their interpretation of the PISA data for the UK. Yet it also highlights a need for better reporting practices of the PISA results in the future, both within the UK and by the OECD. I thus conclude by calling upon the UK Statistics Authority to conduct a review of the PISA 2018 data for the UK, and for them to issue some "best practice" guidelines for the reporting of data from future PISA waves.
The paper now proceeds as follows. Section 2 provides background to the design of PISA and how it is implemented across the UK. Section 3 focuses upon issues with the PISA 2018 data for Scotland, while section 4 provides an analogous discussion for England, Northern Ireland and Wales. Conclusions and recommendations for future PISA data collections follow in section 5.

Background
The OECD -who lead the PISA study -treat the UK as a single country (Sizmur et al 2019a).
This means that it is the data for the UK as a whole that is subject to the OECD's 'technical standards'. Out of the four UK nations, only Scotland participates in PISA as an 'adjudicated' sub-national entity (i.e. a fully-fledged stand-alone participant). This means that additional technical details are reported for Scotland in the annexes to the OECD's PISA technical reports (e.g. OECD 2019). As a sub-national entity, Scotland is also held accountable (as an individual entity) to the OECD's technical standards.
Although England, Wales and Northern Ireland do not participate in PISA as adjudicated subnational entities (and are not individually judged against the OECD's technical standards) they do draw an oversample of schools to facilitate national reporting. Thus each of the four UK nations produce their own national analyses (Sizmur et al 2019a(Sizmur et al -2019cScottish Government 2019), with separate figures for England, Wales, Northern Ireland and Scotland reported on the PISA results day. Thus exactly how the UK (and its four constituent nations) participates in PISA is somewhat more complicated than first meets the eye.
Target population PISA is widely interpreted as a measure of 15-year-olds skills in science, reading and mathematics. However, the actual target population is somewhat more nuanced, defined as "students aged between 15 years and 3 (completed) months and 16 years and 2 (completed) months at the beginning of the testing period, attending educational institutions located within the adjudicated entity, and in grade 7 or higher" (OECD 2019: Annex I). As noted previously (Spaull 2019), the specifics underpinning this definitionparticularly the focus upon pupils who are enrolled in schoolhas some important implications. In particular, it is likely to inflate PISA scores in countries where a non-trivial proportion of 15-year-olds are not enrolled in school (mainly lower and middle-income settings).

Sampling frame and school-level exclusions
With the target population in hand, a sampling frame is constructedessentially a list of all schools within a country that includes 15-year-old pupils. However, from this sampling frame, countries are permitted to exclude some schools due to either logistical reasons or where there is an expectation that most pupils would not be eligible to participate (OECD 2019). For example, in England, special schools, hospital schools, secure units, international immersion schools and pupil referral units were excluded on this basis (Sizmur et al 2019a). The OECD data quality standards stipulate that a maximum of 2.5% of schools can be excluded for such reasons (OECD 2019: Annex I) 1 , with the PISA 2018 data for the UK within this limit (2.2% for the UK and 1.7% for Scotland). Nevertheless, any such school-level exclusions made by a country could contribute to the PISA data becoming unrepresentative of the target population.

School sampling
After excluding a small number of schools, those remaining on the sampling frame within each country are "stratified" into different groups (known as explicit stratification). The precise stratification variables used within each of the four UK nations differs (see Appendix A for details) but typically include some combination of broad geographic region and school type.
Then, within these explicit strata, schools are ranked/ordered by a set of further characteristics (known as implicit stratification). The most important stratification variable used in England, Scotland and Wales is historic performance in national examinations (e.g. attainment 8 scores in England) 2 . Within each of the four UK countriesand within each explicit strataschools are then sampled with probability proportional to size.

School non-response
As with any study, not all schools that are asked to participate in PISA agree; there is a problem of school non-response. A somewhat unusual feature of PISA is that, if a school refuses to participate, a substitute/replacement can take its place. Specifically, for each school initially sampled, two possible replacements are also selected at the same time. These are typically schools that are adjacent to the originally sampled schools on the sampling frame, and should thus be similar in terms of historic school performance on national examinations (at least in England, Wales and Scotland, where this information is used in the stratification of the sample).
In reality, this approach to school non-response is a form of imputation, with an implicit Missing At Random (MAR) assumption being made.
The OECD set criteria for the level of school non-response they deem 'acceptable', as illustrated by Figure 2. The aim is for each country to successfully recruit 85% of originally sampled schools (before any replacements are included), with the vast majority achieving this in PISA 2018. In contrast, if less than 65% of originally sampled schools fail to participate, then the data for the country should be considered of unacceptable quality and excluded from the PISA results (although, in reality, even some countries that fail to reach this "minimum" benchmark are not excluded by the OECDsee Anders et al 2020). If a country achieves between a 65% and 85% response rate amongst initially sampled schools, then replacement schools can be included to meet the OECD's school response rate criteria. However, the after replacement school response rate target also increases. For instance, if a country has a 70% response rate amongst initially sampled schools, they would need to achieve a 93% school response rate after the replacements are included in order to fulfil the OECD's technical standards (Sizmur et al 2019). If they fail to do so, then a country must produce a school-level non-response bias analysis (NRBA) to demonstrate whether there is any bias in the final school sample. This NRBA is adjudicated by the OECD to decide whether to include the country's data in the PISA results. However, as noted by Anders et al (2020), this adjudication process is actually quite weak, with only 3 out of 23 instances of a NRBA leading to a country being excluded between PISA 2000 and 2015.

<< Figure 2 >>
Importantly, the PISA 2018 data for the UK "did not fully meet the PISA 2018 participation requirement" (Sizmur et al 2019) due to high levels of school non-responseas illustrated by Figure 2. School non-response was a particular problem in England (72% before replacement school response rate) and Northern Ireland (66%), while Wales and Scotland met the OECD's criteria. The OECD thus required the UK (as a whole, single entity) to produce a non-response bias analysis. The OECD's technical group judged "that no notable bias would result from the [school] non-response" (OECD 2019: Chapter 14). I return to this point in section 4, when discussing issues with the PISA 2018 data for England and Northern Ireland.
Within-school sampling of pupils All schools that agree to participate in PISA are asked to provide a list of all pupils who meet the definition of the PISA target population (i.e. pupil's aged 15-years 3-months and 16-years 2-months) at the time the assessment is due to take place. Using these lists, 40 pupils are randomly selected from within each school to participate in PISA. Note that the age-based definition used in PISA means that the pupils selected may fall across multiple school year groups (an important point that shall be returned to when discussing anomalies with the Scottish PISA data in the section that follows).
However, not all of these age-eligible pupils who have been selected to take the PISA test will actually sit the assessment. We term this "non-participation" in this paper, noting that this can occur for three reasons.
The first is "within-school" exclusionsmeaning that schools can decide not to test some of the sampled pupils. The OECD technical standards state that such within-school exclusions should total less than 2.5% of the PISA desired target population, and that the combination of school-level and within-school exclusions should not exceed 5% of the target population (OECD 2019: Annex I).
The second is ineligibility; pupils who were included on the age-eligible pupil list, but who were then considered to not meet the definition of the target population. Importantly, this "ineligible" category includes pupils who left the school between the time the sample was Finally, there is the issue of pupil non-response. The remaining pupils (or their parents) may not consent to take part in the study or pupils may be absent on the day of the test. The OECD technical standards stipulate that such pupil non-response must not be greater than 20%, otherwise a pupil-level non-response bias analysis will need to be conducted. In reality, almost all countries meet these standards (e.g. in PISA 2018, out of the 80 participating countries, just one -Portugal -did not meet this threshold).

A note about weights
The OECD database includes a set of weights. Given the issues discussed above, it is important to understand what these weights achieve, and the implications for analysis of PISA data for the UK.
The first key function of these weights is that they correct estimates for unequal probabilities of schools being selected into the PISA sample (in part due to the oversampling that occurs across the UK). This element of the weights also scales figures up to the UK population. A key implication is that all figures reported for the UK by the OECD are driven by the data for England, given that this country accounts for 84% of 15-year-olds who live in the UK (Sizmur et al 2019c). This, in-turn, also means that the UK-wide figures reported by the OECD serve as a close proxy of the results for England as a stand-alone country. On the other hand, the UKwide figures almost completely mask the situation in Scotland, Northern Ireland and Wales.
The second key role of these weights is that they make some limited adjustment for nonresponse. Specifically, the weights use school-level data in the form of the stratification variables (see Appendix A) along with some very basic pupil characteristics (year group and grade) to try and account for school and pupil non-response. As the stratification variables for England, Wales and Scotland include measures of historical school performance in national examinations, these weights may do a reasonable job of correcting for school non-response (see Micklewright, Schnepf and Skinner 2012 for some empirical evidence on this issue). On the other hand, as argued by Anders et al (2020), the weights provided are highly unlikely to reduce bias due to pupil non-participation, given the very limited amount of pupil-level data (just gender and grade) included in their construction. Moreover, previous work has suggested that it is non-response amongst pupilsrather than by schoolsthat drove bias in the PISA data for England in PISA 2000 and 2003 (Micklewright, Schnepf and Skinner 2012). Thus, in reality, the weighting scheme used within PISA is unlikely to solve potential bias induced from the various ways pupils drop out of the study (particularly when these are not due to school non-response).

Test month
A final unusual feature of PISA in the UK is when the assessment takes place. In most Northern Hemisphere countries, PISA is conducted between March and August. However, since 2006, England, Wales and Northern Ireland have received special dispensation from the OECD to conduct PISA between October and December, so to as avoid conflicts with GCSE examinations 3 . One important implication of this is that almost all pupils who sit the PISA test in England (97%), Wales (98%) and Northern Ireland (92%) are in the equivalent of Year 11.
In Scotland, the situation has been different. Up until 2015, the PISA test was conducted between March and May. This changed, however, in 2018 when the test period moved to between October and December. The reason behind making this change has notto my knowledgebeen documented either in the Scottish or OECD reporting of the PISA results.
Yet, as I will discuss in the next section, it may have important implications for interpretation of the PISA data for Scotland.

Summary
The above outlines key aspects of how the PISA data is collected, with a summary of this complex process provided in Figure 3. This documents how there are many channels via which the final PISA sample may become unrepresentative of the population of 15-year-olds in a given country, including school/pupil exclusions, non-response and important nuances that emerge via the eligibility criteria. In the following section, I discuss how these factors accumulate in a case study of the PISA 2018 data for Scotland.

Anomalies in the PISA 2018 data for Scotland
High-levels of pupil exclusions The first issue to highlight with the Scottish PISA dataand, indeed, for the UK as a wholeis the comparatively high rate of pupil exclusions. This is illustrated in Figure 4, with the pupilexclusion rate plotted along the horizontal axis and the total exclusion rate (encompassing both pupil and school level exclusions) plotted along the vertical axis. The dashed lines represent the cut-off thresholds for the maximum level of such exclusions permitted by the OECD technical standards.

<< Figure 4 >>
There are two key points to note. First, Scotland (as well as the UK overall) narrowly failed to meet the PISA technical standards on both these exclusion criteria. Specifically, within-school exclusions totalled 3.8% of the population in Scotland (3.3% for the UK as a whole) compared to a guideline maximum of 2.5%. Likewise, Scotland's (5.4%) and the UK's (5.5%) total exclusion rate also surpassed the 5% maximum specified in the PISA technical standards (OECD 2019: Annex I, standard 1.7). In other words, strict application of this aspect of PISA's data quality criteria would have led Scotlandand, indeed, the whole UKto be removed from the study.
Second, these exclusion rates for Scotland and the UK are higher than in most other countries.
Although Scotland and the UK are clearly not alone in violating the OECD's technical standards, the average within-school exclusion rate across all participating countries is substantially lower than in the UK (standing at 1.4%) as is the total exclusion rate (3.0%).
Why is this likely to be important? Such pupil-level exclusions typically occur due to issues surrounding special educational needs or recent immigrants into a country with limited language skills. These are hence pupils who would likely obtain comparatively low scores were the PISA test accessible to them. Yet, if some countries (e.g. Scotland) are more likely to exclude such children from the sample than other countries (e.g. Japan or South Korea, where the pupil-exclusion rate totals less than 0.1%) then this is likely to introduce bias into crosscountry comparisons of their PISA performance. Indeed, the OECD technical standards on exclusions are designed to limit the potential bias in the mean score from such exclusions to around five PISA test points (Rutkowski and Rutkowski 2016). In other words, the exclusion rates observed for Scotland (and the UK as a whole) could alone lead to a non-trivial five-point decline in average PISA scores. Although this may seem an innocuous change at first, it has some potentially important impacts upon the Scottish PISA sample. These stem from the precise definition of the PISA target populationthose aged between 15 years and 3 (completed) months and 16 years and 2 (completed) months at the beginning of the testing period. In other words, by altering the test dates, Scotland may have changed the composition of pupils being tested. This point is illustrated by Table 1, which presents the percent of the Scottish PISA sample in the S4 and S5 year groups by survey round. Up until 2015, the vast majority (almost 90%) of the Scottish PISA sample were enrolled in S4, with only a small minority (just over 10%) enrolled in S5. Yet, due to the change in the test date, in 2018 there was an even split of the Scottish PISA sample across these two year groups (50% in S4 and 50% in S5). Of course young people in a later school year group (S5) may well have a different distribution of academic skills than those in an earlier year group (S4). Moreover, it is also likely that this date changeand the potential change in the composition of exactly who is being testedmay impact upon measures of educational inequality. Unfortunately, I know of work by either the Scottish government or by the OECD to try and quantify the potential impact of this important change upon Scotland's PISA results.

<<< Table 1 >>>
Perhaps the most unfortunate aspect of this key change is the lack of transparency with which it (and its potential implications) have been reported. First, in the PISA 2018 report for Scotland, it is noted how the tests were conducted between October and Decemberbut without any mention of how this was different to in previous years. Second, to my knowledge, no justification has been presented as to why the test date was changed. Third, the PISA 2018 report for Scotland includes a whole section discussing issues with interpreting trends in PISA data over timebut completely fails to recognise this key issue.
Finally, the methodology sections of the PISA 2012, 2015 and 2018 reports for Scotland are almost identical (copied almost word for word). Importantly, the 2015 report clearly states "students were mostly (87.5%) in the S4 year group" (Scottish Government 2015:10), with a similar statement in the 2012 report "students were mostly in the S4 year group" (Scottish Government 2012:6). Yet no such statement is made in the 2018 edition. In other words, this key piece of text has been selectively removed from the Scottish PISA report in 2018, in what is otherwise an almost identical passage of text. This is despite the fact that this information is clearly now more relevant than ever, given the change of test date.

High-rates of pupil ineligibly / withdrawal
A further issue that may be related to the change of test month is documented in Figure 5. This plots the percent of the 'ineligible'/ 'withdrawn' pupils from PISA 2015 (vertical axis) to PISA 2018 (horizontal axis) by country. Note how Scotland is a clear outlier in two ways. First, the percent of ineligible pupils in Scotland in 2018 (9.3%) is much higher than in any other country (OECD average = 1.6%; all-country average = 1.7%). Second, the percent of ineligible/withdrawn pupils has more than doubled in Scotland between 2015 (4.1%) and 2018 (9.3%). In comparison, in most other countries, the figures have remained broadly stable at a much lower level.

<<< Figure 5 >>>
To my knowledge, this issue has not been commented upon anywhere by either the OECD or the Scottish government. There hence seems no 'official' explanation for why it has occurred.
Here, I offer what I believe to be the most likely explanation.
To begin, recall that ineligible pupils are identified after the pupil sample has been drawn within participating schools. Then, according to the PISA 2018 report for Scotland, pupils may be classified as ineligible if they had left the school (between when the PISA sample was drawn and when the test was conducted): "Students that had left the school in the interim were not considered part of the target sample" (Scottish Government 2019:10).
It hence seems that the high "ineligibility" rate for Scotland in PISA 2018 is being driven by an unusually large number of pupils leaving the school between when the sampling is done and the PISA test window.
Why might this occur? One possibility is that this is related to when national examinations ("nationals") take place in Scotland, and young people's subsequent educational pathways.
Specifically, young people in Scotland take their "nationals" at the end of S4. Then, after completing S4, young people may decide to change schools; for instance to move to a further education college to pursue a more vocational education pathway. Now recall from the sub-section above how almost 90% of the Scottish PISA sample were in S4 in PISA 2015. This means that the vast majority of pupils in Scotland had not yet taken their "nationals" and hence were likely to be in the same school at the time when the sampling was done and the time the PISA test was conducted. This changed, however, in PISA 2018 with the movement of the test date, with PISA now spanning equally S4 (pre-nationals) and S5 (postnationals). Consequently, there may now be many more pupils in the PISA sample who have left their school after taking their nationalsthus leading to the high and rising levels of "ineligibility" observed in Scotland.
Importantly, those pupils who change schools between S4 and S5 are probably lowerachievers; school mobility has previously been linked with lower-levels of achievement (Strand and Demie 2007), while young people who pursue vocational courses tend to haveon average lower levels of academic achievement. In other words, the high levels of pupil "ineligibility" for Scotland in PISA 2018 may have led to Scotland removing some lower-achieving pupils from the sample.
Unfortunately, without any further detail available on what exactly is driving the high ineligibility rate in Scotland, it is difficult to say for certain why it has occurred and to fully appreciate the consequences of it. To try and find out more, I made a freedom of information request to the Scottish governmentwith the full list of questions asked and responses provided available from https://www.whatdotheyknow.com/request/720228/response/1725609/attach/3/Response%20 202100141438.pdf?cookie_passthrough=1. In this, the Scottish government has confirmed how the high ineligibility figure in Scotland is "likely to reflect the change in the timing of the PISA assessments in Scotland", with the PISA pupil lists provided during the school summer holidays and before the census at the start of the new academic year. They hence have now confirmed the explanation that I offered abovethat the high-level of ineligibility has been driven by pupils moving between schoolsmost likely between S4 and S5. Yet they also go on to note how they are unable to precisely quantify the extent of this problem, because they "do not hold information on how many of the ineligible students had left school between the sampling and the assessment dates".

Low pupil response rates
As noted previously, the OECD's technical standards require that "the final weighted student response rate is at least 80% of all sampled students across responding schools" (OECD 2019: technical standard 1.11). An important caveat to this point, however, is that within-school exclusions and pupils deemed ineligible (as outlined in the sub-sections above) are not counted in these figures. Likewise, pupils within schools with low levels of participation are also not included in the official pupil response rate calculation. Thus, in reality, the technical standard applied is not 80% of all sampled pupils. Rather, it is 80% of those who were sampled, not already excluded by their schools (due to, for instance, special educational needs) and in schools where pupil participation rates exceed 50% (explained in more detail below).
Nevertheless, Figure 6 illustrates how each country performed against this technical standard in PISA 2018. From this, there are three key points to note. First, only one (Portugal) out of the 80 participating countries failed to reach the 80% threshold. This could either be seen as a triumph of PISA encouraging pupils to respond or, as Anders et al (2020) argue, the fact that the 80% response rate threshold is too low, and not a sufficiently robust criteria to inspire confidence in the representivity of the sample. Second, other than Portugal, Scotland had the lowest pupil response rate of any participating country (80.5%). Finally, the pupil response rate was also low for other parts of the UK, with the figures for England (83.2%), Wales (85.5%) and Northern Ireland (83.7%) each below the OECD average (90%).

<< Figure 6 >>
There are, however, some questions as to whether the true pupil response rate in Scotland is even lowerand that they have only apparently managed to (just) reach this technical standard due to a subtle technicality about how the pupil response rate has been calculated. In particular, the PISA 2018 report for Scotland notes: "In total, 3,767 students were deemed eligible to take part" 4 It then states: "Of these, a total of 2,969 students took part" This would hence give an unweighted response rate of 78.8% for Scotlandfalling just below the 80% threshold 5 .
How has this discrepancy occurred? Using the figures provided in the PISA 2018 report for (c) 80 pupils whose status has not been accounted for.
It seems that these 80 pupils belong to schools with particularly low pupil rates 6 . In PISA, individual schools where only 25-50% of sampled pupils complete the test are incorporated into the school non-response figures -not the pupil response ratedespite these schools/pupils being included in the final database (and thus contributing to Scotland's PISA scores). This has led to the OECD excluding 29 (responding) and 51 (non-responding) pupils from the calculation of the pupil response rate in Scotland, with their two schools included in the schoollevel non-response calculation instead. I discuss this issue in further detail in Appendix B. In this I note how if these 80 pupils were included in the pupil non-response figures (which is arguably more appropriate) then Scotland's response rate would be 79.6% -below the 80% threshold.
It is my view thatgiven this accounting anomalya pupil-level non-response bias analysis should have been conducted by Scotland for its PISA 2018 data; it either fell very narrowly above or narrowly below the desired 80% threshold depending on exactly how one chooses to do the non-response calculation. In particular, assuming that pupil non-response is not random (e.g. for instance, because low achieving and disadvantaged 15-year-olds are much more likely to be absent from school than high-achieving, socio-economically advantaged 15-year-olds), the fact that one-in-five did not complete the test has clear potential to bias the PISA results. It thus seems important that the magnitude of such potential bias is investigated and transparently reportedregardless of whether falls just above or just below this 80% threshold.
Unfortunately, to my knowledge, neither the Scottish government nor the OECD has investigated this issue, or published any such evidence in the public domain.
The overall impact of the above: low coverage of the target population Thus far I have considered these issues in isolation. Yet what really matters is their cumulative impact. When they are taken together, how far has the PISA sample moved away from the target population?
Evidence is presented on this matter for Scotland and the UK as a whole in Table 2 7 . The first row provides an estimate of the number of 15-year-olds living in Scotland and the UK (drawn from OECD 2019: Chapter 11). Then, moving down the rows, it provides an indication of the reduction from the target population through to the final (weighted) PISA sample, due to all the various issues discussed above (and overviewed in Figure 1). The information in Table 2 has been drawn from OECD (2019: Chapter 11) and provides, in my view, the most comprehensive picture of how the various forms of non-participation (exclusions, "ineligibility", school non-response, pupil non-response) affect the PISA sample.

<< Table 2 >>
The first key point to take from Table 2 is that, for Scotland and the whole UK, almost 40% of the target population gets removed from the (weighted) PISA sample. If this 40% is not a random selection (and, as argued above, there is good reason to believe that they will tend to be lower-achievers) then this large amount of non-participation has clear potential to introduce bias into the results.
Second, importantly, it helps illustrate how the main problem in Scotland -and the UK as a whole -occurs at the pupil level, not at the school level. In other words, a lot more pupils get lost from the target population due to non-participation amongst pupils, rather than nonparticipation by schools 8 . Take the figure for Scotland, for example. A total of 5,741 (53,398 -47,657) pupils from the target population are lost due to either school-level exclusions or school non-response. This compares to 14,147 (47,657 -33,510) due to non-participation amongst pupils.
This is important because almost all the adjustment the OECD makes to attempt to control for selective non-participation occurs at the school-level (via the use of replacement schools and non-response adjustments incorporated into the weights). Almost no adjustment is made for non-participation at the pupil-level (other than for some very basic allowance of differential non-response by grade and gender), despite this beingas Table 2 illustrateswhere the major potential problems occur. For simplicity, we implement this simulation using just the first plausible value. Further details about this approach can be found in Anders et al (2020).

<< Table 3 >>
Results from this simulationfocusing upon readingare presented in Table 4 for Scotland  There is, of course, a large degree of uncertainty surrounding such resultsas our simulation results in Table 4 reflect. The bias brought about by such non-participation might be higher or lower, depending upon the exact characteristics of the non-participants who have been selected out. Yet this also clearly illustrates how the large amount of non-participation means that there is quite substantial uncertainty surrounding the UK's PISA results.

England
There are two specific concerns with the PISA 2018 data for England. The first, as illustrated by Figure 2, is the high level of school non-response. In particular, England failed to meet the OECD's technical standards, and was required to conduct a school-level non-response bias  (1) to column (2) reiterates the point made above; non-responding schools had lower levels of prior GCSE performance. Hence there is a greater share of schools in the top three achievement quartiles in column 2 (participating schools from the original sample) than in column 1 (the full original sample).
As noted in section 2, PISA has two ways of trying to deal with such school non-response: (a) via allowing replacement schools to take the place of non-responding schools and (b) to include a non-response adjustment in the final weights. These are the results presented in columns (3) and (4) respectively, which seemingly bring the figures much closer to those observed for the full original sample in column 1.
At first glance, the similarity between columns (1) and (4) may seem reassuring. But does this really mean that all the potential problems surrounding school non-response have been resolved?
Unfortunately not. In fact, such a comparison is a clever slight-of-hand. To understand why, recall from section 2 how replacement schools are selected, and how the PISA weights are constructed. With respect to the former, replacement schools are selected as those adjacent to the non-responding originally sampled school on the sampling framewhich has been implicitly stratified by the historic school performance variable presented in Table 5b. In other words, the inclusion of replacement schools will mechanically improve the comparisons being made, as the variable in question helps to determine which replacement schools get selected. This is because the same variables are being used to adjust for non-response (through the selection of replacement schools) and then to also judge whether this non-response adjustment has "worked".
A similar intuition holds for why applying the weights leads the improvement in Table 5b; historic school performance (which is being used to judge the likely bias in the sample) has a direct role in how the weights (which are being used to adjust for the likely bias) are constructed. Hence once the weights are applied it is unsurprisingand, in fact, mechanicalthat the distribution of historic school performance (presented in column 4) moves closer to the distribution for originally sampled schools (presented in column 1). Further discussion is provided on this matter in Appendix C. This point has actually been noted by other countries that have had to perform similar bias analyses, such as the United States, which states how such comparisons: "may provide an overly optimistic scenario, resulting from the fact that substitution and nonresponse adjustments may correct somewhat for deficiencies in the characteristics examined, but there is no guarantee that they are equally as effective for other characteristics and, in particular, for student achievement." (National Centre for Education Statistics 2019).
If those responsible (the NFER and Department for Education) really wanted to know about the bias school non-response brought into the PISA sample, they would have conducted a different analysis. Table 5b would have still been produced, but using pupil-level data from the schools for the cohort in question (i.e. pupils in these schools who took their GCSE in 2019), focusing upon the distribution of Key Stage 2 scores and/or their final GCSE grades 10 . This approach would have two key advantages. First, by using pupil (rather than school) level data, the analysis would have much more power to detect potential differences. Second, it would illustrate potential bias in a key variable (i.e. one that is reasonable strongly associated with PISA scores) than has not been directly used in the selection of replacement schools or in the construction of the response weights. It would not suffer the problem of the same variable (school-level historic GCSE performance) being used to both adjust for non-response and then also to judge whether that non-response adjustment has "worked".
My interpretation of the available evidence on potential bias in the PISA sample for England from school non-response is hence not as optimistic as the views of the OECD or as presented in England's national report. Of course, such matters are never black and white, and are often a matter of judgement and opinion about the evidence available. Yet this helps to iterate a recurring theme presented throughout this paper. In order for academics and policymakers to come to their own reasoned judgements on such issues, it is vital that the evidence is openly and transparently reported when the PISA results are released, as a matter of course.
Unfortunately this is not currently the case, with little more than a nebulous paragraph about such issues relegated to the annexes of the reportswith no hard data presented to support the claims being made.
The second major issue for the PISA 2018 data for England can be inferred from Tables 2 -4.
Specifically, as England dominates the UK figures (making up 84% of the weighted sample), it becomes clear that there has been significant non-participation in the study in England. This has occurred through various channelsand is not primarily driven by the issue of school nonresponse discussed above. In particular, as can be inferred from Table 2, England not only had high levels of school non-response, but also high-levels of within-school pupil exclusions and pupil non-response. Thus, as can be inferred from Tables 3 and 4, the PISA data for England suffers the same challenges as the data for Scotlandwith the various forms of nonparticipation having a large cumulative impact upon the sample (Table 3), thus meaning that there is quite a high-degree of uncertainty over England's PISA scores (Table 4).
Importantly, however, it is possible to investigate potential bias in the PISA sample for England in one additional way. As part of my freedom of information request, I additionally asked for the GCSE grades obtained by the PISA 2018 cohort (these examinations were sat just six months after they took the PISA test) 11 . These can then be compared to the national distribution of GCSE grades for 16-year-olds which, unlike PISA, is based upon data from all Year 11 pupils (and thus do not suffer from issues such as school non-response, pupil exclusions or pupil non-response). The results of this comparison can be found in Table 6 (panel a). However, as Wales is not a full participant in PISA (i.e. it is not an "adjudicated sub-region"),

<<
little is known about the proportion of excluded or ineligible pupils.
As the PISA 2018 data for Wales has also been linked to pupils' administrative records, it is possible to compare the GCSE grades they achieved to the national grade distribution. This, in-turn, can be used to provide some insight into whether bias may have crept into the Welsh PISA sample. The results from this analysis can be found in Table 5  Wales, compared to if a truly representative sample from the population had been drawn. This is substantial, and further illustrates how caution is needed when interpreting the PISA 2018 data for Wales.

Northern Ireland
As illustrated by Figure 2, school non-response was significantly higher in Northern Ireland than the rest of the UK. In fact, if just one fewer originally sampled school had refused to take part, Northern Ireland's before replacement response rate would have fallen below 65% -and would have been considered "not acceptable" (if judged against the OECD's technical standards) 12 . Moreover, the use of replacement schools (and the non-response adjustment incorporated into the PISA weights) is likely to be less successful strategy in guarding against non-response bias in Northern Ireland than in England, Wales and Scotland. This is because the stratification variables used in Northern Ireland -which play a key role in PISA's nonresponse adjustmentsdo not include any information on historical school performance in GCSE examinations (or equivalent), unlike the rest of the UK (see Appendix A for details).
12 Also, the unweighted response rate was 64.7% -below the 65% threshold.
If Northern Ireland was an adjudicated sub-region in PISA, the OECD would have required a school-level non-response bias analysis to take place. However, as Northern Ireland does not participate in PISA as an independent nation, this was not required by the OECD.
Yet the PISA 2018 report for Northern Ireland clearly states that such a non-response bias analysis did take place (Sizmur et al 2019c):  The non-response analysis for Northern Ireland was not sent to the OECD.

 The non-response bias analysis was undertaken by the National Foundation for
Educational Research (NFER), who were the contractors for the PISA 2018 study and was shared just with UK government officials.
 It was hence some combination of UK government officials and the NFER who came to the judgement of the results of the bias-analysis showing "positive" results (though the split of responsibilities remains unclear).
 Critically, there was no outside scrutiny of the non-response bias analysis produced (not even by the OECD).
What about the evidence presented in the non-response bias analysis itself? Was it really as "positive" as claimed in the national report?
The non-response bias analysis for Northern Ireland followed the approach that was used for England, described above. Participating schools were compared to non-participating schools to see if they were similar in terms of observable characteristics. Then the distribution of these characteristics were compared across the original sample and participating schools (both before and after replacement schools were included, and with and without weights applied). I summarise what I believe to be the key figures from this non-response bias analysis in Table   7 13 .

<< Table 7 >>
There are two key points to note. First, very few characteristics of schools have been compared.
In particular, the only variables considered are gender (boys only school, girls only school, mixed), region and school-type. The clearestand most importantdifference to the bias analysis conducted for England is that no information on historic school performance in GCSE examinations is included. Second, the sample size is smallonly 102 schools at mostwith seemingly much reliance upon whether differences are "statistically significant" or not. There are of course questions about whether such significance tests are even valid in such a context (Gorard 2010). Yet even if one accepts significance tests are a valid approach here, with only around 100 observations, any such tests will be woefully underpowered. In other words, this combination of an investigation of limited characteristics and reliance upon statistical significance means it is almost impossible to detect if any bias is present or not.
13 The full bias analysis is available from https://www.whatdotheyknow.com/request/pisa_2018_data?nocache=incoming-1717380#incoming-1717380) It hence seems that for Northern Ireland an absence of evidence is being used to claim that there is absence of bias. The problem is that the investigations of potential bias have been extremely limited, with an almost impossibly high bar set. Again, as with the bias analysis conducted for England, the actual results of the analysis are open to interpretation, with different individuals likely to form different opinions based upon the evidence available. Yet, as I argued above in the case of England, it is critical that such evidence is clearly and transparently reported, so that independent judgements can be formed. Relegating such information to a couple of non-descript sentences in the appendixsimply saying that the results are "positive" and that the sample is unequivocally "representative"should not be considered acceptable.
Note that an additional issue with the Northern Ireland data is that no information is published about the number/proportion of school exclusions, within-school exclusions or ineligible pupils. Hence it is not possible to provide comparable figures to those for Scotland and the UK presented in Tables 2-4. Hence, overall, it is difficult to estimate the cumulative impact that the various forms of non-participation has had upon Northern Ireland's PISA results.

Conclusions
PISA is a widely watched study of 15-year-olds skills in reading, science and mathematics.
Since its inception in 2000, it has had major impact upon governments and education policy, driving changes to schooling systems across the globe. In the United Kingdom, PISA has become the main resource to compare inputs and outcomes across its four devolved nations, representing the only UK-wide assessment taken by a sample of pupils on a regular basis. The triennial PISA results have hence become a high-profile issue in each of England, Northern Ireland, Scotland and Wales.
Unfortunately, PISA has a rather chequered history in the UK. Specifically, after the UK was excluded from the results of the 2003 edition due to concerns over low response-rates and data quality, the validity of the PISA study was brought into question (Jerrim 2013). Many assume that such issues are now in the past, given how the data for the UK has always been deemed to be of acceptable quality in all subsequent PISA cycles. Yet, in reality, the situation is much more complex than first meets the eye. There remains many ways for countries to not test pupils who are technically part of the target population, with lower-achievers disproportionately likely to be removed from the sample. The aim of this paper has been to explain how such issues arise based upon a case study of the PISA 2018 data for the UK. In doing so, it is hoped that the paper helps to broaden understanding of these technicalbut importantpoints amongst a wider audience.
The paper illustrates how the UKand, by implication, its four constituent nationshave some of the lowest overall participation rates out of any country. Importantly, this nonparticipation seems to be mostly driven by selection of pupils out of the sample (e.g. pupils not turning up on the day of the PISA test) rather than by schools. This occurs through various channels, including schools excluding certain pupils from the test, pupils being classified as ineligible due to school moves and non-response. Moreover, for some parts of the UK, there is clear evidence that this has a non-trivial impact upon the representivity of the PISA data, leading to a sizable upward bias in average scores. For instance, I estimate that average PISA mathematics scores in Wales would likely be around 15 points lower if a truly representative sample of pupils had taken the test.
The paper has also raised some issues surrounding transparency of reporting the PISA results.
In Scotland, important changes were made to PISA in 2018such as changing when the test was taken. Yet this change, and its implications, have not to my knowledge been documented by either the Scottish government or by the OECD. Likewise, other clear anomalies with the Scottish data (e.g. the very high number of "ineligible" pupils) have not been explained or discussed. In England, a non-response bias analysis was produced, but not published (it was only obtained by the author via a freedom of information request). A similar non-response bias analysis was produced for Northern Ireland, yet with even less clarity about what exactly was produced and how this evidence was judged (again, this information was only obtained via a freedom of information request). Finally, in Wales and Northern Ireland, key pieces of information are not reported as a matter of course, such as the number of within-school exclusions and ineligibility rates. This means that we do not currently have any handle on the overall non-participation rates in PISA in these parts of the UK. These are all basic facts about the data that have not been transparently reported, clearly thought through or discussed.
There are of course some limitations of this work. First, although I have illustrated how nonparticipation is high across the UKand that this clearly leads to bias in the data for at least some of the constituent nationsit has not been possible to investigate its source. For instance, it is not possible with the data available to establish whether it is school non-response, pupil non-response, within-school exclusions or pupil mobility that is driving the bias clearly observed in the PISA data for Wales. Further data, tracking each pupil via administrative records through each stage of the selection process outlined in Figure 2, would be needed to provide further insight into this issue. Second, the focus of this paper has been PISA 2018, and not how these issues may have affected previous PISA rounds. As this paper has illustrated, gaining access to and understanding all the nuances for even a single round of PISA is challenging. This task then gets multiplied if one attempts to consider multiple PISA sweeps.
Yet building up a clearer picture on this matter is clearly important to help academics and policymakers build a better picture of the reliability of PISA to inform about changes over time.
Finally, the paper has presented a case-study for the UK. Such issues may of course impact upon other countries as well, particularly those with low overall participation rates, accompanying the UK towards the bottom of Despite these limitations, the work has clear implications for policy and practice. The most pressing issue is for the UK Statistics Authority to conduct an independent review of the UK's PISA data. This should include a focus upon the transparency of reporting and documentation of key issues, providing some "best practice" guidelines for each of the four UK governments to follow in the reporting of future PISA rounds. To help facilitate this, Wales and Northern Ireland should follow Scotland's lead and apply to become "adjudicated sub-regions" in PISA.
Although this paper has illustrated how this is no panacea to all potential problems, it would ensure that some key information about the Welsh and Northern Irish data gets reported by the OECD, and that the PISA data for these countries will be held to the same technical standards as Scotland's and (essentially) England's. In addition, as each of the four UK nations conduct national examinations not long before/after PISA is conducted, they each have access to highquality data to investigate and document potential bias in the sample (similar to the comparisons I have presented in Table 5). The UK thus actually has very good data to thoroughly investigate the issues discussed throughout this paper, but currently does not do so.
Yet such analyses are informative, quick and simple to conduct, and should be reported for future rounds of PISA as a matter of course. Finally, at an international level, the OECD needs to reconsider its technical standards, the strictness that these are applied, and its data adjudication processes. The evidence presented in this paper illustrates how the processes currently in place flatter to deceive and are nowhere near robust enough to support the OECD's claims that PISA provides truly representative and cross-nationally comparable data.    Figure for Croatia rounded down from 107% to 100% (likely due to inaccuracies in data on total population size). * indicates country had to conduct a non-response bias analysis. + indicates other issues with data "adjudicated" by the OECD (usually due to thresholds stipulated within technical standards not met).   . For Wales, data on PISA grade distribution based on freedom of information request submitted by the author https://www.whatdotheyknow.com/request/pisa_2018_data_2. Data on "official" grade distribution taken from https://statswales.gov.wales/Catalogue/Education-and-Skills/Schools-and-Teachers/Examinations-and-Assessments/Key-Stage-4/gcseentriesandresultspupilsaged15only-by-subjectgroup, using data for the 2018/19 academic year. Average PISA scores by grade based upon Gambhir, Dirie and Sizmur (2020 : Table 3.3)and refers to data on best GCSE grade out of numeracy and mathematics.

Figure 2. School response rates before and after replacement in PISA 2018
Notes: Horizontal axis refer to the "before replacement" school response ratethe percent of initially sampled schools that completed the PISA test. Figures on the vertical axis refer to the "after replacement" response ratethe percent of schools that completed the PISA test after substitute schools are included in the figures. Dark-blue area, where the "before replacement" level is below 65%, means the technical standard has not been met and the country should be excluded for the PISA study. The light blue "acceptable" area is where countries are fully compliant with the PISA school response rate technical standard. The "intermediate zone" in the middle refers to where the OECD technical standard has not been fully met, with countries required to complete a school-level non-response bias analysis.     The OECD (2019 : Table 11.8) then goes on to show how, of these 3,687 pupils, 718 pupils were counted as absent on the day of the test. As Appendix Table B1 illustrates, this is consistent with the figures for non-consent/absence provided by the Scottish government in response to my freedom of information request (formed of 122 pupils whose parents did not consent and 596 pupils who were absent on the day of the test).
Unfortunately, this leads to a potentially important discrepancy in the figures reported by the Scottish Government. If one takes the number of "eligible participants" report by the Scottish government (3,767) and subtracts the number of parent non-consent (122)  There are hence 80 (3,049 -2,969) pupils that have gone missing from the Scottish sample and have not been accounted for. Yetas Appendix Table B1 illustratesthis determines whether Scotland falls just above or just below the 80% pupil response rate threshold.
What has happened? It appears that this discrepancy of 80 pupils is due to two individual schools having a particularly low response rate. This has led to the OECD excluding pupils within these schools from the calculation of Scotland's official pupil response rate, and including their two schools in the school non-response figures instead 14 .
As noted by Chapter 4 of the PISA 2018 technical report (OECD 2019): "A school with a student participation rate between 25% and 50% was not considered as a participating school for the purposes of calculating and documenting response rates…….
However, data from such schools were included in the database and contributed to the estimates included in the initial PISA international report." In other words, individual schools with low levels of pupil response do not form part of the pupil response rate calculationrather their schools are moved to the school non-response figures instead. This is despite the data from such schools being included in the PISA database and contributing to a country's results. Hence the decision to include these pupils in the school non-response figures -rather than the pupil non-response calculationis somewhat perplexing, given that "selection" out of the study is being driven at the pupil level (i.e. their school has attempted to conduct PISA, but insufficient numbers of its pupils have agreed to take part).
How then does this issue play out in the data for Scotland?
If one downloads the international PISA database (https://www.oecd.org/pisa/data/2018database/) and looks at the data for Scotland, one sees that the number of pupils is 2,998 (from across 110 schools). This is 29 more pupils (and two more schools) than the 2,969 pupils (from across 108 schools) that have been used in the OECD's calculation of the official pupil response rate (see Appendix Table B1). These are presumably the 29 pupils (out of the 80 sampled) who took the PISA test in the two schools with low pupil response rates. Indeed, it would imply that across these two schools the pupil response rate was 36% (between the 25% and 50% level for this unusual situation to come about).
If these 80 observations are included in the pupil response rate calculationwhich I believe is more appropriate than treating their schools as non-respondentsthen the numerator in the calculation becomes 2,998 (matching the number of observations in the final PISA database) while the denominator becomes 3,767. This would lead to the response rate for Scotland being 2,998 / 3,767 = 79.6% -falling below the 80% threshold 15 .
Thus, in essence, Scotland has only managed to exceed the 80% threshold -using the OECD's calculation -because two outliers (i.e. two schools with particularly low pupil response rates) have been removed from the pupil response rate calculation. 15 Although this is an unweighted figure, the weighting would seem to make a trivial difference here. In OECD (2019 : Table 11.8) the difference between Scotland's weighted and unweighted pupil response rate is tiny -0.02%

A final aside
The oddness of this approach to calculating pupil response rates can be illustrated in two ways.
First, the minimum pupil response rate a country can theoretically achieve is 50% (not 0% as one might assume). This is because any school with a pupil response rate below 50% gets moved into the school non-response calculations instead.
Second, it is possible for the number of pupils within sampled schools to increasebut for the "official" pupil response rate to decrease.
To understand why, recall how the official calculation of the pupil response rate in Scotland is: 2,969 / 3,687 = 80.5% These figures were used because the OECD decides to exclude from the calculation any school where the number of pupils tested falls between 25% and 50% of those sampled. As noted above, if these pupils are included in the pupil non-response figures instead, then the pupil response rate becomes: 2,998 / 3,767 = 79.6% Now just say that we managed to increase the response rate within these two schools up to 50% (i.e. we managed to successfully test 40 of the 80 pupils across these two schools, rather than 29). This would increase the numerator in the "official" pupil response rate calculation up to 2,969 + 40 = 3,009. Yet the denominator used in the official calculation would also increase up to 3,767. This would then give an official pupil response rate of: 3,009 / 3,767 = 79.9% In other words, we have managed to test more of our sampled pupilsyet the official pupil response rate would go down from 80.5% (just above the 80% threshold) to 79.9% (just below the 80% threshold). Thus Scotland could have actually got more of the sampled pupils to take part in PISA, but see its pupil response rate fall (with its school response rate increasing instead).
The analysis presented in Tables 5b and 7b of the main text have been used by the NFER, OECD and national governments to justify their view that the non-response bias analysis shows that the samples are "positive" and "representative". Yet the comparisons upon which they focus are likely to give an overly optimistic picture of the ability of replacement schools and weighting to reduce bias in the statistic of interest (PISA scores).
To understand why, I reproduce Table 5b below for England. Column (1) provides a binary measure of historic school achievement for all of the originally sampled schools (i.e. those 199 schools that were meant to take part). Column (2) then provides the analogous figures for the 144 of these schools that actually did take part. This comparison reiterates the point made in the main text; lower-achieving schools were more likely to refuse to take part in the study (there were only 28% in the participating sample compared to 33% in the full original sample).
PISA has two ways of trying to compensate for this problem. The first is via the use of "replacement schools"; for those schools that refused to participate a substitute can take its place. This is a form of imputation, and is subject to a Missing At Random (MAR) assumption.
Critically, these substitute schools belong to the same stratum as the school that refused to participate. Hence a school that refused to participate that was in the bottom 40% of the attainment distribution is replaced by another school in the bottom 40% of the attainment distributionan entirely sensible thing to do. This, however, does mean that the figure for the "bottom 40% of attainment 8 distribution" can essentially only go up between columns 2 (before replacements included) and 3 (after replacements included)it has been forced to do so. The extent that this would in turn also force upwards the unobserved quantity of primary interest (PISA scores) is open to debate; unless school-level PISA scores and school-level historic GCSE performance are perfectly correlated, it is likely to provide an overly optimistic picture.
The same logic then applies to once the weights are applied in column (4). The sample after replacements are included (column 3) is still underrepresenting lower-achieving schools. The non-response adjustments made in the PISA weights will recognise this, and thus ensure that schools with lower historic GCSE performance are "worth" more in the analysis. Again, this is an entirely sensible thing to do. It does, however, mean that there is a mechanical increase in the percentage of schools with low historic GCSE grades in the sample between columns (3) and (4) the extent that this will help to reduce the potential bias in the unobserved quantity of interest (PISA scores) is open to debate.
It thus follows that a comparison of columns (1) and (3) and of columns (1) and (4) in Appendix  Table C1 provides an overly optimistic perspective of how well the non-response adjustments (replacement schools and weights) have "worked" in reducing the likely bias in the quantity of interest (PISA scores). In my view, at best, such an analysis can only provide evidence of whether there are very serious concerns (e.g. one would be extremely worried if even if with the use of replacement schools and weighting that the difference between columns 1 and 4 continue to materially differ).
It is also important to reiterate a further point made in the main text of the paper; only a very limited number of school-level variables have been used to investigate potential bias in both England and Northern Ireland. The evidence available on potential biasand whether the data are "representative" -that has been presented by the NFER, English/Northern Irish governments and the OECD is very limited. In my view, it does not provide particularly strong evidence whether school non-response has led to bias in the sample or not. Yet more detailed analyses of the data would have been possible at the time the non-response bias analysis was produced at least in the case of England (e.g. a comparison of Key Stage 2 scores at the pupil level) but do not seem to have been conducted.
Thus, as noted in the main text of the paper, it is open to interpretation as to whether has much confidence in the above as evidence of the samples being "representative" or not. What in my view is unforgivable, however, is that the NFER, OECD and the national governments have not clearly and transparently presented the evidence to allow independent individuals to make up their minds of the strength of the evidence for themselves (or even told them how this judgement was reached). Rather, they have chosen just to say the results are "positive" and that the data are "representative" whenin realitythis is at best only a partial reflection of the evidence available.