Risk of bias and open science practices in systematic reviews of educational effectiveness: A meta-review

In order to produce the most reliable syntheses of the effectiveness of educational interventions, systematic reviews need to adhere to rigorous methodological standards. This meta-review investigated risk of bias occurring while conducting a systematic review and the presence of open science practices like data sharing and reproducibility of the review procedure, in recently published reviews in education. We included all systematic reviews of educational interventions, instructions and methods for all K-12 student populations in any school form with experimental or quasi-experimental designs (an active manipulation of the intervention) with comparisons and where the outcome variables were academic performance of any kind. We searched the database Education Resources Information Center (ERIC) through the years 2019–2021. In parallel we hand-searched four major educational review journals for systematic reviews: Educational Research Review (Elsevier)


INTRODUCTION
As education research grows, so does the need to synthesise the plethora of claims regarding the effectiveness of a vast array of instructional methods and interventions, such as early reading instruction, support for struggling students or designing educational interventions review; of these, 10 educational systematic reviews were judged as low risk of bias (approximately 11%) .The rest were classified as high risk of bias during a shortened ROBIS assessment or assessed as high risk or unclear risk of bias following a full ROBIS assessment.Of the 10 low risk of bias reviews, 6 had detailed their search sufficiently enough for a third party to reproduce, 3 reviews shared the data from primary studies, however none had specified how and from where exactly data from primary studies were extracted.The study shows that at least a small part of systematic reviews in education has a low risk of bias, but most systematic reviews in our set of studies have high risk of bias in their methodological procedure.There are still improvements in this field to be expected as even the low risk of bias reviews are not consistent regarding pre-registered protocols, data sharing, reproducibility of primary research data and reproducible search strings.

BACKGROUND
The history of synthesising research findings in education is long (a concept used before the term 'systematic review' became more popular), perhaps only preceded by the metaanalyses by Gene V. Glass in 1976, with the first synthesis being published in education as early as 1980 (Ahn et al., 2012).Since then, governmental agencies and non-governmental organisations focusing on education have been growing in number, in order to help stakeholders, schools and the public to be better informed of how to best organise schools and support teachers with the best available evidence.Although attempts to synthesise research findings have a long history, more rapid statistical development came during the 1980s to improve statistical precision (i.e., meta-analyses), followed by concerns for methodological quality and biases in primary studies (Schünemann et al., 2013;Whiting et al., 2016).
Of particular interest for estimating research evidence is Cochrane and its sibling organisation Campbell Collaboration.Whereas Cochrane establishes standards for conducting systematic reviews especially in medicine and healthcare, Campbell Collaboration focuses more on social interventions including education research, and can be referred to as the leading organisation for systematic reviews in education.Together, and with additional assistance from other organisations, such as What Works Clearinghouses in the USA and the UK, they set the standards of how to perform the highest quality of systematic reviews, in order to rigorously and reliably synthesise evidence.The question is, then, whether researchers and journals adhere to these standards and if it has impacted the research community in education.
First we present standards for systematic reviews and standards for assessing risk of bias.Second, we present the larger overarching review project and the current sub-report of that project, as well as details of the study's aim and scope, the PICOS that informs the eligibility criteria for the inclusion of systematic reviews in this particular sub-report and our research questions.

STANDARDS IN SYSTEMATIC RE VIE WS
Pioneer organisations, such as Cochrane, developed standards for conducting high-quality systematic reviews and other review formats (Chalmers et al., 2002) such as scoping reviews, rapid reviews, evidence maps and living reviews.Typically, a Cochrane review (i.e., a systematic review sanctioned by the organisation) follows MECIR-Methodological Expectations of Cochrane Intervention Reviews (Higgins et al., 2022)-which is a comprehensive guideline covering all crucial aspects of the review, including the entire workflow and the reporting.MECIR is also closely connected to a handbook, the Cochrane Handbook for Systematic Reviews of Interventions (Higgins et al., 2019) and other useful programs and tools, such as Revman and the PRISMA flow diagram for documenting the search and inclusion of studies.A widely disseminated generic guide for authors is PRISMA-Preferred Reporting Items for Systematic Reviews and Meta-analyses (Liberati et al., 2009;Page et al., 2021)-which some journals require to be followed or consulted.PRISMA is considered the minimum set of items reported needed for a systematic review and includes guidance for reporting methods used to identify, select, appraise and synthesise studies.
A systematic review is created as a method 'to collate all the empirical evidence that fits pre-specified eligibility criteria in order to answer a specific research question.It uses explicit, systematic methods that are selected with a view to minimizing bias, thus providing more reliable findings from which conclusions can be drawn and decisions made' (Lasserson et al., 2021, section 1-1).A systematic review is often exclusive among review formats with its emphasis on asking specific questions, in comparison with, for example, scoping reviews that instead are broad in scope.A specific question entails research questions, such as the effect of an intervention or a method.The method also includes that the search should have identified all relevant studies that match the study eligibility criteria and steps taken to ensure appropriate quality or assess quality or bias in the included studies, both within studies (e.g., limitations in method) and between studies (e.g., publication bias), as well as reproducible workflows (e.g., searches and synthesis).Similarly, the Campbell Collaboration uses an adapted form called MECCIR-Methodological Expectations of Campbell Collaboration Intervention Reviews (Wang et al., 2021).Both organisations guide review teams to apply 'best practice' throughout the review.Therefore, rigorous systematic reviews are the preferred choice for gathering evidence that can inform researchers and stakeholders about the effectiveness of different interventions.
Although systematic reviews with meta-analyses are considered the best method for synthesising research evidence (Schünemann et al., 2013), they are constantly criticised for providing a skewed estimate of the literature because of the many misuses (Sotiriadis et al., 2020).This includes, for example, not setting up a pre-registered detailed protocol for the planned review, combining studies with different populations, not conducting exhaustive searches, ambiguous selection of studies, combining studies of different designs or outcome | 5 of 26 EXPLORING THE IMPACT OF LANGUAGE MODELS measures, not performing sensitivity or robustness analyses or acknowledging publication bias and including studies of low quality (Polanin et al., 2020;Sotiriadis et al., 2020).The standards set-for example, by Campbell Collaboration-are an approach to combat all these methodological limitations.

RISK OF BI AS
As primary studies comprise the data in systematic reviews, review teams should appraise the quality of the included studies.A widely used tool for assessing bias, or quality, of included primary studies in systematic reviews is the revised risk-of-bias tool for randomised trials, RoB 2 (Sterne et al., 2019).This tool focuses on assessing bias in different domains of a study, arising in different steps of an intervention or a trial study, and helps review authors to judge how it affects the estimate of the intervention effect and is defined as 'a systematic deviation from the effect of intervention that would be observed in a large randomised trial without any flaws' (Sterne et al., 2019, p. 1).For each of the five domains, review authors answer signalling questions, such as 'Was the allocation sequence random?' and 'Were there deviations from the intended intervention that arose because of the trial context?'Each domain is then given an assessment of either 'low risk of bias', 'some concern', or 'high risk of bias', which ultimately leads to an overall risk of bias judgement.This judgement can then be used in various ways, for example to affect the weight of an individual study in a meta-analysis or inform a decision to remove an individual study from the synthesis if assessed as high risk of bias, because studies with a risk of bias can inflate effect sizes or even jeopardise overall conclusions regarding the effectiveness of an intervention (Sotiriadis et al., 2020).Equivalent tools for assessing study quality, including designs other than RCT, also exist, such as the Newcastle-Ottawa Scale (Wells et al., 2000).
Similar to possible risks of bias arising in primary studies, systematic reviews are not immune to biases, for example, selection bias.Assessing bias in primary studies is considered crucial in systematic reviews and is also a signalling question in ROBIS-Risk of Bias in Systematic Reviews (Whiting et al., 2015(Whiting et al., , 2016)-a tool for assessing risk of bias in systematic reviews.At the time ROBIS was developed, few similar tools existed that specifically addressed the risk of bias of systematic reviews (Whiting et al., 2016).ROBIS was developed in close connection to Cochrane Collaboration's handbook MECIR (and as an extension, also to MECCIR, which is used for educational research at Campbell Collaboration).Following an extensive four-stage developmental process with review experts using a Delphi approach, content domains and signalling questions were derived from 46 items identified as bias items from the MECIR handbook.This focused on the practical process of how to conduct a review, namely bias arising from how well the research question and the eligibility criteria overlap, searches, review process, synthesis and conclusions (Whiting et al., 2016).Developers agree that some form of formal training in systematic reviews and meta-analyses are required; however, ROBIS assessments have been shown to be reliable across raters, even for inexperienced users, with good construct validity (Bühn et al., 2017).
For the purpose of this study, we choose to use ROBIS to assess risk of bias.The reasons to use it were that the ROBIS tool is in close connection to MECIR, which guides how to conduct high-quality reviews.In addition, the generic nature of the tool itself is suitable because it can be used for any type of intervention design included in a systematic review (i.e., RCT, quasi-experimental designs including single subject designs), it provides clear guidance of how to properly assess studies and it covers the entire review process, which we think suits well with the purpose of this study.Another potential tool to choose is foremost AMSTAR-2, which has mostly overlapping signalling questions and has been shown to provide approximately similar end-judgements with ROBIS (Banzi et al., 2018).
ROBIS consists of three phases: (1) relevance, (2) concerns with the review process, and (3) a judgement of risk of bias.When reviewing the review literature, the first phase is about identifying reviews that adhere to the scope of the project.In Phase 1, the target question of the review project should be equal to that of the review being screened (e.g., 'Do phonic instructions affect word reading outcomes?').Target questions are steered from the project's predefined PICO-Population, Intervention, Comparator and Outcome-and the authors need to assess the match of the included studies in the review being scrutinised.For example, if a project aims to gather evidence from instructional intervention studies with an RCT design, the review in question needs to have such studies that meet the defined PICO included in the review.A well-defined PICO (or PICOS, where S stands for study type or setting) is crucial for the relevance check, as outlined in ROBIS Phase 1.
Phase 2 consists of four domains, with the first domain being 'Eligibility criteria', which directly connects to phase 1 (specifically for matching the PICOS to relevant studies).The ROBIS tool then allows reviewers to answer signalling questions in the following rubrics 'Identification and selection of studies', 'Data collection and study appraisal' and 'Synthesis and findings'.Based on the assessment and the answers to the signalling questions, an overall judgement is made in phase 3 'Judging risk of bias', which results in a review being judged as either Low, High or Unclear risk of bias.All ROBIS phases and questions are supported by a guidance tool that aids reviewers in providing an accurate and reasonable estimate, although the strength of the assessment lies in the systematicity of assessing the risk of bias, which can be directly used to differentiate the evidence level in different systematic reviews.

THE CURRENT STUDY
This study is part of a larger review project, which means that the searches and assessment of systematic reviews in education will be further updated once a sufficiently large corpus of systematic reviews in education is published (see Elliott et al., 2017).The project aims to gather the best available evidence in education (the preregistered protocol of the larger project can be found at https:// doi.org/ 10. 17605/ OSF.IO/ 34AEH ).This is conducted by identifying high-quality systematic reviews that are assessed as having a low risk of bias (i.e., the current study), and where data are available (Interoperable and Reusable), as in FAIR (Wilkinson et al., 2016).Shared data can be used for publishing so-called CAMAs, Community Augmented Meta-Analysis (Burgard et al., 2021;Tsuji et al., 2014;  Furthermore, open science practices such as transparency and sharing of data and code should not be considered just as a trendy feature.Systematic reviews that aim to fully review a research area need to be transparent themselves throughout the entire review process (Polanin et al., 2020), in order to reduce bias that could arise in the different steps of the review process.
In this study, we analyse systematic reviews in education published between 2019 and 2021, identified through the database ERIC as well as hand-searching four major review journals in education: Educational Research Review (Elsevier), Educational Review (Taylor & Francis), Review of Education (Wiley) and Review of Educational Research (AERA).Although this search strategy is not exhaustive enough to find all systematic reviews of educational interventions, it will still allow us to retrieve studies explicitly labelled as systematic reviews published in major scientific journals.We will not treat our final set of reviews as a representative sample of all published systematic reviews and focus only on the characteristics of the reviews we find; thus, we do not aim to generalise beyond this set of studies, similar to the meta-review by Polanin et al. (2020), who focused only on a set of 150 studies from the journal Psychological Bulletin.We also reasoned that searching further back than 2019 is redundant at best because we are mainly interested in whether the highest standards have become the standard approach, and we believe recent publications will show if that is the case.
The knowledge contributed by this study can be used by all stakeholders and others interested in assessed risk of bias of synthesised effects of educational interventions and methods, and which areas of the reviews have low or high risk of bias.

THE FOCUS OF OUR STUDY
In this project, the PICOS is defined as follows: Population-all K12 student populations; Intervention-all types of educational/instructional methods and interventions; Comparatorshould include (any type of) a comparison group; Outcome-variables should directly involve student academic performance; and Study type and Setting-designs should be RCT or quasi-experimental designs (i.e., an active manipulation of the intervention), and set in any school form, including special schools, clinics, home teaching and institutions (e.g., juvenile detention).Note that this stringently operationalised PICOS was chosen to include only properly designed evaluations of effect studies (McKenzie et al., 2019), guided by Campbells' handbook MECCIR on how to define and set up an adequate and focused review (Wang et al., 2021).We are aware that there are other types of review questions (e.g., mapping the prevalence of reading difficulties in a school population) that will not be covered by this meta-review.
After the searches, we screen titles and abstracts based on our PICOS (using exclusion criteria), followed by full-text screening using the relevance check in ROBIS and screening for risk of bias, first using selected items from ROBIS, and if a review passes this phase, conduct the full ROBIS assessment.In addition, because we are also interested in publishing CAMAs (Burgard et al., 2021), we perform a basic reproducibility check (i.e., if reported search strings are reproducible and the availability of extracted data from the primary studies).However, for this sub-report, we only report the findings and do not actually extract any data from the reviews.However, it still signals quality as shared material (including searches, coding of studies, extracted data, and code and analyses) is considered the norm in systematic reviews (Higgins et al., 2019;Polanin et al., 2020) as well as outlined by the FAIR principles (Wilkinson et al., 2016).
For the current study, we ask five research questions: 1. What systematic reviews of effects of educational interventions/instructions/methods on student performance have been published in 2019 and forward? 2. Which of these studies summarise evidence from RCTs or quasi-experimental primary research?3. Which of these studies have low risk of bias? 4. Which of these studies have available data extracted from primary studies that can be added to a Community Augmented Meta-Analysis platform? 5. Which of these studies have detailed their (a) database searches and (b) extracting of data with enough transparency that it could be reproduced by a third party?

METHOD
The preregistered protocol for this review can be found at: https:// doi.org/ 10. 17605/ OSF.IO/ WXFDJ .We chose to use selected items from the PRISMA 2020 checklist for the reporting of this review.Deviations from the protocol are reported in the Methods section.

Inclusion and exclusion criteria
As specified in the protocol, we included systematic reviews that focus on evaluating the effects of any type of K-12 educational intervention, instruction or method, irrespective of type of implementer (e.g., teacher, research staff or self-instructing programs), and where the outcome variable is student academic performance of any kind.See Table 1 for details regarding the inclusion and exclusion criteria.

Information sources
We used the Education Resources Information Center (ERIC) to identify systematic reviews published between 2019 and 2021.In parallel to database search, we hand-searched the following four major education review journals for additional systematic reviews, mainly as an extra caution to find a reasonable number of reviews, and not relying entirely on the ERIC search: Educational Research Review (Elsevier), Educational Review (Taylor & Francis), Review of Education (Wiley) and Review of Educational Research (AERA).

Search strategy
We searched the ERIC database through the Proquest interface provided by Linnaeus University, Sweden, using the basic search term 'systematic review' in abstracts, keywords, and titles.The rationale for searching only on the term 'systematic review' (and not, e.g., the term 'meta-analysis') was that we are only interested in studies that aim to follow the appropriate format explicitly designed for synthesising effects (that covers all crucial aspects of the review process).Authors not labelling their review, or meta-analysis, as a systematic review, will not likely adhere to the stringency of a systematic review.In particular, as the first statement of the PRISMA checklist is to label the report as a systematic review in their title and abstract.
Searches started with 2021-01-01 through 2021-06-30 (search was performed on 2 July 2021), followed by 2020 (search was performed on 3 August 2021), 2019 (search was performed on 22 September 2021), and ended with 2021-07-01 through 2021-12-31 (search was performed on 2 February 2022).The references were transferred to the Zotero library.All searches are found in the Supporting Information.As for the journals, we downloaded all articles from the four journals between 2019 and 2021 into separate folders in Zotero, using the Zotero plug-in interface.

Selection process
After searches, references were checked for duplicates and were then transferred to the software Rayyan (Ouzzani et al., 2016) where the title and abstract screening took place.Screening was conducted by comparing the review with our PICOS.We excluded studies

Exclusion criteria
Population: We include all types of students from all countries, and from both private, charter and public schools.We will also include subgroups (i.e., struggling students, students with difficulties, students with disabilities, students with disadvantage) College students or adult populations.
Intervention: We include interventions carried out by both teachers, special instructors, researchers etc., as well as self-instructive programs We exclude pharmaceutical treatments (e.g., for ADHD) Comparison: We include all types of comparison groups: teaching/treatment as usual, active controls, another intervention, wait-lists etc.It can also be within-participants (i.e., student's previous performance is their own control)

No exclusion criteria
Outcomes: We include a wide range of student performance outcomes, such as academic skills, for example in reading, writing or math, or subject knowledge, for example in science, social science or civics, or skills highly relevant in education, for example, critical thinking, creative thinking or reasoning Outcomes should not be exclusively psychological (e.g., self-efficacy, motivation, well-being, or procrastination) or behaviour such as school attendance or altered behaviors towards work environment or general outcomes (eg., environmental-, climate-, or ethical issues) if not part of subject knowledge or curriculum Study type: We include all experimental designs (randomised or quasi-experimental) that involve an intervention as well as a comparison group.This includes withinparticipant designs, single-subject designs, time-series etc. where the student is their own comparison, as well as norm-reference as comparison groups Designs should not be correlational or observational studies.We will exclude articles not written in English when the title or abstract made us certain that their PICOS was not similar enough to warrant full-text reading.
At the full-text reading stage, the studies' PICOS were coded, to be able to compare it with ours.At this stage, we realised that some studies had a broader PICOS than ours (e.g., included both K-12 students and adults), and we had not decided in advance how to deal with such scenarios.Therefore, we decided to add three coding items (deviation from protocol), where the first item being if PICOS fully matched (Yes/No).Following this, in terms of a 'No', we judged whether a study had a broader PICOS (Yes/No) and if the PICOS could be separated to match ours, for example that the review had separate analyses for K-12 (Yes/No).

Data collection process
The three authors independently extracted data from eligible reviews using Google sheets.To gather details of the existence of protocols, searches and data, we searched the website of each study in their respective journals, such as supplementary material, and also on protocol repositories such as PROSPERO and Open Science Framework registries if no details were provided on the journal's websites.

Data items
The ROBIS items for the full assessment of systematic reviews can be found in Table 4, as well as details of the existence of a pre-registered protocol (Yes/No), sufficient details of the search in order for it to be reproduced (Yes/No), and the existence of extracted primary research data (Yes/No) that was the basis for the syntheses in the systematic reviews (e.g., M, SD and N for baseline and post-intervention for all groups).Additionally, details of whether the extracted data could be reproduced from the primary studies are also presented (Yes/ No).Systematic reviews from the full-text reading stage were coded using PICOS, and can be found in the Appendix S1.

Study risk of bias assessment
Systematic reviews included in the review were assessed for risk of bias using ROBIS with guidelines (Whiting et al., 2016(Whiting et al., , 2023)), conducted in two stages.Two review authors assessed the risk of bias first by using the shortened ROBIS on half of the records and each were double-checked by a third reviewer.This first stage only includes items that were simple to use for excluding studies with a high risk of bias (see Table 2).A study that received a 'No' on one of the items was not coded further, as this indicates high risk of bias.However, as we noted that it was rare that authors had pre-registered a protocol independently from the review, it was decided that we only coded whether a protocol existed because this would otherwise have resulted in very few included studies (deviation from our protocol), albeit possibly rigorous systematic reviews in all other aspects.We think it is safe to say that pre-registering a protocol is not yet the standard approach or tradition in education research.
The second stage referred to a full ROBIS check and was conducted by two independent coders.One reviewer (who has a long experience in researching education) double-checked all coding in the full-text reading phase.The inter-rater reliability was 93% (including all items and all risk judgements).Disagreements were resolved by discussion.

Study selection
From a total of 2308 records screened after duplications have been removed, 258 studies initially matched our PICOS and were read in full and coded.Of these, 88 reviews matched our PICOS and were subjected to shortened ROBIS assessment.See Figure 1 for the PRISMA flow diagram of included studies.
During the full-text reading stage it became clear that a large amount of studies had not specified all crucial details of the review.Thus, while trying to specify the PICOS of the reviews, we discovered that out of 258 studies, 118 (45%) reviews had no specification of comparison groups used in the original studies, 59 (23%) reviews lacked information on what type of studies that should be included, followed by 45 (17%) reviews that lacked details of the outcome, and 43 (17%) reviews did not contain specification on which population was studied.

Study characteristics
For more details on what kind of populations, interventions, comparisons, outcomes and study designs each review focused on, see Appendix S1.Table 4 provides detailed information of the full ROBIS-assessments and is visualised in Figures 2 and 4.

RESULTS OF SYNTHESES
The first and second research question were simply what systematic reviews on educational interventions, instructions and methods on student performance have been published between 2019 and 2021, which also included 'effect' syntheses from proper causal research designs (i.e., we only included RCTs or quasi-experimental primary research with comparisons).Our search revealed altogether 88 systematic reviews.A further 170 reviews were judged to not match our PICOS following the full-text reading stage and were excluded.
It should be noted, however, that some of the excluded reviews in the full-text reading stage were close to our PICOS, where some had the aim of providing 'effect evaluation' (or similar) but did not specify details that were sufficiently clear-for example, concerning

ROBIS domain Assessment
1.1 Did the review adhere to predefined objectives and eligibility criteria?
1.3 Were eligibility criteria unambiguous?
2.1 Did the review search an appropriate range of databases/electronic sources for published and unpublished reports?

2.2
Were methods additional to database searching used to identify relevant reports?
3.1 Were efforts made to minimise error in data collection?
3.2 Were sufficient study characteristics available for both review authors and readers to be able to interpret the results?

3.4
Was risk of bias (or methodological quality) formally assessed using an appropriate tool?

3.5
Were efforts made to minimise error in risk of bias assessment? the research questions, which population was studied, methods used, or specification of outcomes.Similarly, some authors did not separate their analyses in order for us to be able to include them (e.g., blending all kinds of designs or age groups together).Some reviews were mislabelled, often described as a systematic review but were either a scoping review, a literature review or a combination (i.e., did not provide a proper 'effect' question and syntheses of data from primary studies), and none of these followed PRISMA adherently enough.
We did not decide how to approach these results in advance, but it will be mentioned in the Discussion section as an explorative finding.
From the 88 reviews that matched our PICOS (or where their PICOS could be separated to match ours), 70 reviews had a 'No' on one of the shortened ROBIS items (and, as stated in the protocol, we stopped coding once a study had a 'No').The most common items where EXPLORING THE IMPACT OF LANGUAGE MODELS reviews did not pass were 2.1, meaning that authors had not searched an appropriate range of databases/electronic sources for published and unpublished reports, and 3.4, where the authors had not formally assessed the risk of bias or methodological quality of the primary studies using an appropriate tool.See Table 3 for the item assessment of the shortened ROBIS.
As for the assessment of risk of bias or methodological quality, we chose 'Yes' to reviews that used any named tool (regardless of how valid or reliable these were), but reviews that did not check quality or risk of bias at all were marked as a 'No'.This decision may be criticised for being too liberal, but as these assessments are still rather unstandardised in education research, we thought this was a reasonable decision to make.
Having a 'No' on any of the items on the shortened ROBIS does not imply that the entire review is biased, but signals that the outcome of the review might be biased.Note that this meta-review does not assess other aspects of a review than for areas inherent in ROBIS, for example, the relevance of the intervention or ethical approaches.ROBIS assesses the risk of bias for the estimated effect size of an intervention (Whiting et al., 2016).For example, not conducting exhaustive searches means that studies are potentially missing, which would have affected the effect size estimate.Not assessing the risk of bias or study quality in the included studies means that studies of lower quality affect the effect size as much as studies of higher quality.
Eighteen systematic reviews passed the shortened ROBIS and were assessed fully using the ROBIS tool, see Figure 2.
The third research question was which systematic reviews had low risk of bias.Ten systematic reviews were judged as low risk of bias.We can therefore state that at least some systematic reviews in evaluating effects from interventions or methods have low risk of bias,  as defined by ROBIS.These 10 reviews make up approximately 11% of the 88 reviews that matched our PICOS, see Figure 3.With a stricter use of ROBIS, potentially only five systematic reviews would pass as having low risk of bias because only half of them also had a pre-registered protocol (the ones who had a protocol matched well with the article, e.g., concerning research questions and the analysis plan).As stated before, we decided to just make a note when a review had/had not a pre-registered protocol.Not registering a protocol in advance (or simply not having one to follow throughout the review work) flags for risk of bias in ROBIS, as there is no way for a third party to assess deviations in the workflow (e.g., concerning ambiguous inclusion/ exclusion of articles or possible alterations of research questions).
However, according to the ROBIS guidelines, a study without an explicit mention of a protocol could still pass, if the 'methods section appears rigorous and all analyses  Still, with no mention of a protocol, it is difficult to judge, although it seems that the 10 systematic reviews truly have a low risk of bias as the methods sections are indeed rigorous.
The fourth and fifth research question concerned data availability from primary studies (that can be reused, e.g., on a CAMA platform) and if studies have detailed their database searches and data extraction with enough transparency that it could be reproduced by a third party.This information is presented in Table 4. Three reviews judged to have a low risk of bias had primary data available in their supplementary material.The others either did not share data at all or only shared data that were transformed in some way (e.g., to a standardised effect size instead of M, SD for each test occasion).
Research question 5b is similar but focuses on the reproducibility of the primary data.None of the three studies reported in enough detail how and from where the data were extracted (e.g., M, SD and N were extracted from Table X, page XX).This means that the data from the primary studies were hard to reproduce.Therefore, we can conclude that no studies in the present review passed the reproducibility check of synthesis data from the primary studies.
Research question 5a focuses on the reproducibility of the search.Six of the 10 low-risk reviews detailed their search so that it could be reproduced by a third party, for example, to provide exact search strings and which filters were used for each search engine.

Systematic reviews with unclear or high risk of bias
It should be stated that although eight systematic reviews were judged unclear or high risk of bias, nearly all reviews passed the first three domains of ROBIS, see Figure 4.It seems that many authors who decide to conduct a proper systematic review do indeed care for these three areas: (1) proper handling of eligibility criteria, (2) identification and selection of studies, and (3) data collection and study appraisal.Only in exceptional cases did the authors use methods that are deemed untrustworthy (e.g., searching and only including studies that cite a seminal work) or where information about, for example, eligibility criteria were missing.Up to this stage in the ROBIS assessment, it is clear that authors know about and make efforts to follow PRISMA, although it can be pointed out that only studies that actually pass the ROBIS checks adhere to PRISMA correctly.
In domain 4, Synthesis and findings, another picture emerged, however.Because we included reviews regardless of a protocol, 11 reviews (of the 18 reviews) were judged with 'No information' or 'No' for the item 'Were all predefined analyses reported or departures explained?'Again, this means that we cannot know if any analyses or sub-analyses were left out, or if authors tried out several statistical models (that might have had a different result) before deciding on a model.The reviews with no protocol are also not transparent in its narrative, as the articles do not contain details of planned analyses and deviations that authors could have mentioned.
The eight systematic reviews judged as unclear or having a high risk of bias also had difficulties with the succeeding items in domain 4. In four reviews, the synthesis was not appropriate (either no statistical synthesis or no meta-analytical weighted estimates), five of the reviews did not address between-study variation (often, simply averaging point estimates from studies with an unequal number of participants into an unweighted effect), and five reviews did not contain any robustness or sensitivity analyses of their findings.Four of the reviews also did not address biases (or did not simply exclude high-risk/low-quality studies from the syntheses), even if studies were formally assessed during an earlier stage in the review process (see item number 3.4, as all included studies passed during the shortened ROBIS).Following this, none of the reviews judged as having a high risk of bias

DISCUSSION
This meta-review of the educational systematic review literature set out to investigate adherence to systematic review standards, as defined by the risk of bias tool ROBIS (Whiting et al., 2016), developed specifically for assessing the risk of bias in systematic reviews.Studies that closely follow the standards set by, for example, the Cochrane and Campbell Collaboration, are assessed as having a low risk of bias, thus providing robust and reliable syntheses of evidence.ROBIS is a tool that allows meta-reviewers to systematically assess the entire research process and method of a review, and ends with guiding a decision for a review being judged as low, high or unclear risk of bias.This can then reliably inform researchers and other stakeholders about the effects of a particular educational intervention.As these standards have been developed over many years, it is crucial to investigate the adherence between published systematic reviews and these standards.In addition, and in line with upcoming legislation changes in data sharing (e.g., that the European Union has adopted the FAIR principles of data sharing), we investigated whether systematic reviews shared extracted data from primary studies as well as whether these reviews have detailed their database searches and extracting of data with enough transparency that it could be reproduced by a third party.
From the 88 systematic reviews that met inclusion criteria, 18 reviews passed the shortened ROBIS and were assessed with all items.Ten systematic reviews were judged to have a low risk of bias, which was approximately 11% of the reviews relevant to this review.These reviews focused on different interventions (e.g., algebra interventions, student-centred instruction, literacy instruction) for different student populations (e.g., K12 students, students with disabilities, and second language learners).It is encouraging that there are reviews with a low risk of bias, but our findings shows that 77 reviews were assessed as having a high risk of bias, and one review had an unclear risk of bias, and that these do not seem to follow all methodological standards developed for producing the most reliable evidence.Interestingly, although they were judged as having a low risk of bias, only five of them had a pre-registered protocol independent of the review.We decided to just document and not exclude articles without a protocol.It is unclear why some systematic reviews, with authors deciding to adopt the rigorous methodological approaches of, for example, Campbell Collaboration's MECCIR (Wang et al., 2021) or similar, do not have a pre-registered protocol published in an open repository.This is one area where we believe there is room for improvement.
Another area in which there is room for improvement is data sharing and in particular reproducibility.Although adopting principles for open science and transparency, such as data sharing and reproducibility, is a more recent methodological development for primary researchers in education research (Cook et al., 2022), the ideals are not new to systematic reviews.Only three systematic reviews judged to have a low risk of bias had the extracted primary research data available.This is a bit surprising as sharing data in reviews should be a natural step to take, as primary data is often coded and extracted, and is not 'sensitive' as it is already published in the primary studies.Sharing data greatly facilitates for (other) researchers to build on and expand research findings for a wider audience, such as publishing open data sets-for example, CAMAs-or to make findings accessible in other ways.A few more reviews (six out of ten) provided enough details for reproducing the searches, though.This includes the exact search strings and engine filters used.
No systematic review had details of where exactly primary data was extracted from.At first glance, this might seem to be a bit obsessive to demand from review authors; however, when critically examining an area, it is crucial to be able to double-check the coding and data extraction.It is possible that journals only need to further emphasise the importance of these steps and provide the infrastructure for detailed data sharing and reproducibility for these ideals to be the standard approach or at least more common.However, we believe that it should be in all researchers' interest to produce reproducible extractions, not solely based on open science practices but also for back-tracking and data organisation.We recommend that researchers adopt this method because it rarely requires more than one more column in the data extraction sheet.
To summarise, out of 88 systematic reviews in education, 31 reviews did not search the appropriate range of databases for published and unpublished reports and were assessed as high risk of bias.Nearly half of the remaining reviews did not formally assess risk of bias in the original studies using an appropriate tool and were therefore assessed as high risk of bias.Ten systematic reviews were judged to have a low risk of bias, which made up approximately 11% of the studies that matched our inclusion criteria.Of these, six had detailed their search sufficiently enough for a third party to reproduce, but only three reviews shared the data from primary studies; however, none specified how and from where exactly data from primary studies were extracted.

E XPLOR ATORY FINDINGS
In this review, we rely on the definition of a systematic review from pioneer organisations, such as the Cochrane and Campbell Collaboration.During the different steps of screening and coding, it became clear that several reviews were mislabelled as a systematic review.Most often these were instead typical scoping reviews or literature reviews (owing to the research questions asked and the (lack of) methods used to synthesise the effects) or a variant of a systematic review without the stringency that characterises a systematic review.We think this stands out a bit in relation to methods and research designs used for primary studies, as there seem to be more common understandings among researchers of what constitutes observational, quasi-experimental and experimental designs in relation to what Additionally, one of the most common problems when trying to identify the PICOS in each review was that in many studies the authors had not specified the most basic information such as population, comparison group or study type.This might be due to many reasons as there are examples of when original studies also do not specify the information in their papers clearly enough.There are also examples of when meta-researchers combine a broad range of papers, mixing different interventions (or at least variants of the same intervention), age groups or comparison groups into one meta-analytic synthesis (Sotiriadis et al., 2020).This is a major problem not only for the meta-research community but for any decision maker or practitioner that tries to search information about a relevant intervention to use in practice.Unfortunately, we do not see the large progress in reporting adherence in this set of studies one would like in comparison to previous studies conducted in the same topic over the last 10 years (Ahn et al., 2012;John et al., 2012).
This also translates to additional findings during the review process.It was clear that most review authors aspired to follow PRISMA, at least for the areas Eligibility criteria, Identification and selection of studies, and Data collection and study appraisal.However, systematic reviews that were judged as unclear or high risk of bias were not at all conclusive when dealing with, for example, the synthesis, between-study variation, robustness/sensitivity analyses or risk of bias in included studies.
Our method of using ROBIS as an assessment tool shows that the majority of included reviews show signs of risk of bias.However, one can argue that the ROBIS assessment tool places too high demands on a systematic review, we know for certain that, for example, pre-registration is not common in older reviews and that a missing protocol could result in high risk of bias or unclear risk of bias.These reviews might not always turn out to be biased, but makes critical scrutiny of the evidence much harder.

APPLICABILIT Y OF ROBIS
Limitations of study conclusion mainly concerns what kind of biases the tool allowed us to identify and bias items that were not part of the ROBIS-tool.While we are confident that the tool quite reliably makes meta-reviewers identify bias in the review process, funding bias is not part of ROBIS.
Given the limited application of the ROBIS tool for risk of bias assessment in meta-reviews in education, we deemed it useful to report on its utility and possible disadvantages.Similarly to AMSTAR-2 the ROBIS tool seems to encompass a comprehensive list of relevant items for risk of bias assessment.Moreover, its guidance documentation is discernibly more comprehensible than AMSTAR-2 (Banzi et al., 2018).Higher clarity of the ROBIS tool might, however, be seen as both an advantage and limitation.Specifically, thorough explanation and examples might lead researchers to unintentionally limit their focus only on issues reported in the manual and miss out on other potential problems.This might particularly be an issue if the rater lacks expertise.Contrarily, higher standardisation of possible answers might increase inter-rater reliability and facilitate conflict resolution.
Furthermore, ROBIS provides flexibility which allows easier utilisation for systematic reviews in social sciences.However, a limitation of the tool is the length, compared to AMSTAR-2, albeit this can be mitigated by establishing predefined stopping criteria on highly relevant items judged high risk of bias, in line with the instructions in the guide (Whiting et al., 2023).

F
Flow diagram of the search and inclusion of systematic reviews.Downloaded from https://bera-journals.onlinelibrary.wiley.com/doi/10.1002/rev3.3443 by Linnaeus University, Wiley Online Library on [01/12/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License | 13 of 26

F
Identification and selection of studies Domain 3: Data collection and study appraisal Domain 4: Synthesis and findings 20496613, 2023, 3, Downloaded from https://bera-journals.onlinelibrary.wiley.com/doi/10.1002/rev3.3443 by Linnaeus University, Wiley Online Library on [01/12/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License T A B L E 3 Stage 1: Short ROBIS assessment.
OF LANGUAGE MODELS mentioned are addressed in the results'(Whiting et al., 2016-ROBIS guidance, p. 25).

F
Risk of bias domains.
Downloaded from https://bera-journals.onlinelibrary.wiley.com/doi/10.1002/rev3.3443 by Linnaeus University, Wiley Online Library on [01/12/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License | 21 of 26 EXPLORING THE IMPACT OF LANGUAGE MODELS 20496613, 2023, 3, Downloaded from https://bera-journals.onlinelibrary.wiley.com/doi/10.1002/rev3.3443 by Linnaeus University, Wiley Online Library on [01/12/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License constitute the different review formats.Despite the influential work of the pioneer organisations, more work is needed to raise awareness of systematic reviews, and which review format that fits best to different types of research questions.
see, for example, Meta-Lab https:// langc og.github.io/ metal ab/ ).A CAMA is an open dataset that researchers can interact with and enables keeping meta-analyses updated as soon as new eligible studies are published.
Did the review adhere to predefined objectives and eligibility criteria?, 1.3 Were eligibility criteria unambiguous?, 2.1 Did the review search an appropriate range of databases/electronic sources for published and unpublished reports?, 2.2 Were methods additional to database searching used to identify relevant reports?, 3.1 Were efforts made to minimise error in data collection?, 3.2 Were sufficient study characteristics available for both review authors and readers to be able to interpret the results?, 3.4 Was risk of bias (or methodological quality) formally assessed using an appropriate tool?, 3.5 Were efforts made to minimise error in risk of bias assessment?Abbreviations: NI, No information; PY, Probably yes; PN, Probably No.

Protocol Reproducibility check (searches, primary data, reproducible data) Domain 1: Eligibility criteria Domain 2: Identification and selection of studies
Studies assessed for risk of bias.Signalling questions in ROBIS: 1.2 Were the eligibility criteria appropriate for the review question?, 1.4 Were any restrictions in eligibility criteria based on study characteristics appropriate (e.g.date, sample size, study quality, outcomes measured)?, 1.5 Were any restrictions in eligibility criteria based on sources of information appropriate (e.g.publication status or format, language, availability of data)?, 2.3 Were the terms and structure of the search strategy likely to retrieve as many eligible studies as possible?, 2.4 Were restrictions based on date, publication format, or language appropriate?, 2.EXPLORING THE IMPACT OF LANGUAGE MODELS addressed the biases identified in Domain 4. One review might have or probably had addressed these concerns; however, as details were missing, the review was judged to have an unclear risk of bias.
T A B L E 4

Data collection and study appraisal Domain 4: Synthesis and findings Risk of bias in the review
Downloaded from https://bera-journals.onlinelibrary.wiley.com/doi/10.1002/rev3.3443 by Linnaeus University, Wiley Online Library on [01/12/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License