Risk of bias in systematic reviews of tendinopathy management: Are we comparing apples with oranges?

We aimed to provide an overview of the use of risk of bias (RoB) assessment tools in systematic reviews (SRs) in tendinopathy management given increased scrutiny of the SR literature in clinical decision making. A search was conducted in Medline from inception to June 2020 for all SRs of randomized controlled trials (RCTs) assessing the effectiveness of any intervention(s) on any location(s) of tendinopathy. Included SRs had to use one of (a) Cochrane Collaboration tool, (b) PEDro scale, or (c) revised Cochrane Collaboration tool (RoB 2) for their RoB assessment. A total of 46 SRs were included. Around half of SRs (46%) did not use an RoB assessment in data synthesis, and only 30% used it to grade the certainty of evidence. The RoB 2 tool was the most likely to determine “overall high RoB” (52%) followed by the Cochrane Collaboration tool (34.6%) and the PEDro scale (18.6%) as determined by the authors of the SRs. We have demonstrated substantial problems associated with the use of RoB assessments in tendinopathy SRs. The universal use of a single RoB assessment tool should be promoted by journals and SR guidance documents.


| INTRODUCTION
The constant emergence of new treatment modalities for tendinopathy over the last few decades and the absence of robust evidence for their effectiveness has led to an increasing number of randomized controlled trials (RCTs). Systematic reviews (SRs) of RCTs constitute the strongest level of evidence and can therefore inform clinical practice, both at a policy level and an individual physician level. A SR should be transparent and reproducible, and subjectivity should be kept to a minimum. 1 Unfortunately, firm guidance on conducting a SR does not exist and several parameters are left to the judgment of the authors. Moreover, recent debate in the Lancet argues that the findings of SRs may be flawed as they often include poor-quality studies that should have not been published in the first place. 2 One of these parameters is risk of bias (RoB) assessment; not only is it a subjective process in its nature, but the existence of several RoB assessment tools further decreases reproducibility by introducing inconsistency. RoB assessment plays an integral role in SRs, and it is an essential part of data synthesis and the reporting of the results. It can be used in one of two ways in a SR, either for subgroup analyses (ie, including only RCTs with low risk of bias) or in determining the strength of evidence for each result in conjunction with other limitations of the included evidence that arise as a result of combining the findings of different studies (consistency, imprecision, etc). 3

| Eligibility
SRs were eligible if they assessed the effectiveness of any intervention(s) on any location(s) of tendinopathy in patients over 16 years of age, included only RCTs, and used one of the following RoB assessment tools: Cochrane Collaboration tool, PEDro scale, RoB 2 tool (revised Cochrane Collaboration tool). Exclusion criteria included SRs including a mixture of randomized and non-randomized studies and a mixture of participants with tendinopathy and other conditions. SRs in languages other than English were also excluded. No criteria were used regarding the following parameters: publication date, journal type, type of tendinopathy and intervention, outcome measures, and length of follow-up.

| Search strategy-Screening
A literature search was conducted by the first author via Medline in June 2020 with the following Boolean operators in "All Fields": "((systematic review) OR (meta-analysis) AND (tendin*) AND (randomi*)).
For all eligible articles, the reference lists and PubMed's "similar articles" list were screened to identify potentially eligible articles that may have been missed at the initial search. Figure 1 (PRISMA flowchart) illustrates the article screening process.
The initial search returned a total of 208 articles. After exclusion of non-eligible articles according to our pre-defined criteria and inclusion of articles identified from reference screening, 46 SRs were included in our review.

| Assessment of consistency of risk of bias assessment
In order to assess for disparity of tools determining overall RoB, we used two separate methods. Firstly, we calculated the proportion of RCTs assessed in all included SRs being determined as of "high overall RoB" for each one of the 3 tools separately and the mean proportion for each tool. Where overall RoB was determined by the authors of the original SR for each RCT, this was used. We also used our own pre-defined criteria (see below) to determine overall RoB for each RCT based on the RoB assessment results reported by the SR authors. Inter-tool reliability was not evaluated formally with statistical tests for this method as the RCTs assessed by each tool were not the same; instead, our purpose was to give a general impression on the likelihood of each tool to determine "high overall RoB" for RCTs and investigate for inter-rater inconsistencies when different criteria are used for the same studies.
Secondly, in light of the newly published RoB 2 tool by the Cochrane Collaboration and its use by the most recently published SR of RCTs in Achilles tendinopathy by van der Vlist et al, 8 we assessed RoB of its 29 included RCTs using the two other RoB assessment tools, the Cochrane Collaboration tool and the PEDro scale. We then compared the reliability among the three tools (Cochrane Collaboration and PEDro as performed by the authors of the present review and RoB 2 by the authors of the original SR) at determining overall RoB. We only tested inter-tool reliability for overall RoB determination and not specific domains of the tools as only the former is directly associated with implementation of RoB assessment in data synthesis.
Inter-tool reliability was only assessed for determining "high overall RoB," which is the aspect of RoB assessment with direct application in data syntheses. "High overall RoB" RCTs determine downgrading of the quality of the evidence, and they are the studies removed for subgroup/ sensitivity analyses. For the purposes of the statistical tests, the 29 assessed RCTs were divided in two categories, "high overall RoB" and "other" ("low overall RoB"/"unclear RoB"/"some concerns"), and each category represented each one of the two possible outcomes in the Cohen's kappa formulas.

Overall RoB determination (our criteria)
The RoB 2 tool provides clear, specific instructions on how the overall RoB for each study should be determined 5 ; therefore, we only used the SR authors' assessment.
With regard to the PEDro scale, its final score is traditionally interpreted as 8-10 "excellent quality" and 6-7 "good quality"; therefore, we used ≥6 as a cutoff to divide high and low overall RoB (or low and high study quality, respectively) firstly as this is the criterion most commonly used by SR authors (PEDro ≥ 6). We also used ≥8 as a cutoff to Full-text articles excluded, with reasons (n = 26) n = 14 included nonrandomised studies n = 12 did not use one of pre-specified risk of bias assessment tools Studies included in data synthesis (n = 46) Additional articles identified through reference screening (n = 2) see which score gives more similar results to the other tools (PEDro ≥ 8). As the majority of authors use the PEDro scale for "study quality" and not RoB assessment, for the purposes of this review "high overall RoB" was synonymous to "moderate" or "poor" study quality. For the Cochrane Collaboration tool, RCTs were considered as "high overall RoB" if they had: (a) high RoB in any of "random sequence generation," "allocation concealment," "blinding of patients and staff," or "blinding of outcome measures" or (b) high RoB in 2 or more of the remaining 3 items ("completeness of outcome data," "selective reporting," and "other") or (c) high RoB in one of the 3 remaining domains if the authors felt the RoB introduced through that domain was significant enough to affect the results of the study. "Unclear overall RoB" was assigned to studies with 3 or more unclear RoB in individual domains not fulfilling the criteria for "high overall RoB," and "low overall RoB" in those not fulfilling the criteria for high and unclear overall RoB. These criteria, especially for the Cochrane tool and to a lesser extent for the PEDro scale, have been specified by the authors of the present review based on advice deriving from the creators of the Cochrane tool and other researchers 9-11 ; they do not represent the "appropriate" criteria as the creators themselves did not specify any; however, we use them to emphasize the extent of inconsistency and subjectivity.
The following formula was used for the calculation of Cohen's statistic between each combination of two tools: where Po: the sum of the mutual RCTs rated as "high overall RoB" and "other" in the two tools; Pe: (proportion of "high overall RoB" RCTs multiplied by proportion of "other" RCTs in tool 1) + (proportion of "high overall RoB" RCTs multiplied by proportion of "other" RCTs in tool 2). Table 1 summarizes the key characteristics of the eligible SRs. 8, Of the 46 included SRs, 31 used the Cochrane Collaboration tool, 13 the PEDro scale, 2 the revised Cochrane Collaboration tool (RoB 2), and 2 both the Cochrane Collaboration tool and the PEDro scale. Modified versions of the PEDro scale and the Cochrane Collaboration tool were used by two and one SRs, respectively. RoB was assessed on an outcome and not study level in only 3 SRs (6.5%). An overall RoB for each assessed RCT/outcome was determined in 17 SRs (37%; n = 7 PEDro scale, n = 2 RoB 2 tool, n = 8 Cochrane Collaboration tool). A total of 21 SRs (46%) did not use the results of their RoB assessment anywhere in data synthesis; the remaining 25 that did used it for either subgroup/ sensitivity analyses excluding "high overall RoB"/"lowquality" studies (n = 9; 36%), for grading the quality of the evidence (n = 14; 56%), or both (n = 1; 4%). Where the quality of the evidence was graded, tools used included the GRADE tool 3 (n = 6; 43%), the Cochrane BRG tool 9 (n = 5; 36%), and the NHMRC tool 1 (n = 1; 7%), while the authors of 3 SRs (21%) graded the evidence arbitrarily without a pre-specified method.

| Overall RoB determination
Where authors of SRs determined overall RoB of assessed RCTs, the following methods were used for each tool: • RoB 2: according to the instructions of the tool (n = 2) • Cochrane Collaboration tool: (a) "overall high RoB" where <3 domains had low RoB (n = 2) or where >3 domains had high RoB (n = 1); (b) "overall low RoB" where the total score of the study was >70% (out of 16; low RoB scored 2, unclear RoB 1, and high RoB 0, n = 1); (c) "good quality study" where no more than 1 domains of the tool, precision and external validity were high RoB (n = 2); (d) method not described (n = 2) • PEDro: (a) "overall good quality/low RoB" where total score ≥6/10 (n = 4), ≥7/10 (n = 1 lee) or ≥7/13 for modified PEDro (n = 1); (b) "overall low quality/high RoB" where total score < 5/10 (n = 2) Table 2 shows the proportion of "overall high RoB" RCTs as determined by (a) the authors of the original SRs where performed, using their own "high overall RoB" criteria and (b) the first author of the present review (DC) based on the RoB assessment performed by the SR authors using our predefined "high overall RoB" criteria for each tool. Mean percentages were calculated for each tool.

| Consistency among tools
Based on the overall RoB assessments reported by the authors of the original SRs, the RoB 2 tool was the most likely to determine a "high overall RoB" (mean proportion of high RoB RCTs 52%), followed by the Cochrane Collaboration tool (mean proportion 34.6%). The PEDro scale was associated with the lowest mean proportion of "high overall RoB" RCTs (18.6%). When the pre-defined criteria of the authors of the present review were applied, the PEDro ≥ 8 was associated with the highest proportion of high RoB studies (65.4%), followed by the Cochrane Collaboration tool (55%), and finally the PEDro ≥ 6 (29.2%).

| Consistency when different criteria used (SR authors vs authors of present review)
Where we determined "high overall RoB" using our criteria based on the RoB assessment results of the SR authors, the mean proportion of "high overall RoB" studies was substantially higher compared to that of the SR authors for the Cochrane Collaboration tool (55% vs 34.6%) and for the PEDro ≥ 8 (65.4% vs 18.6%). For the PEDro ≥ 6, the difference was less significant (29.2% vs 18.2%) as the majority of SR authors using the PEDro chose a ≥6 cutoff too. The highest variability for individual SRs between the proportion of studies with "high overall RoB" of the SR authors and ours was observed in the Cochrane tool (eg, 3% vs 73% for Dong et al 29

| Inter-tool reliability in example systematic review
Tables 3a and 3b shows the RoB assessment that we performed for the 29 RCTs of the van der Vlist 7 SR using the Cochrane Collaboration tool (Table 3a) and PEDro scale (≥6 and ≥ 8) (Table 3b) with our criteria. Table 3c shows the RoB assessment as performed by van der Vlist et al 7 using the RoB 2 tool and the results of the overall RoB assessment from the other two tools as derived from Tables 3a and 3b, highlighting the generally poor inter-tool reliability. The only comparison that produced substantial reliability (k = 0.76) was that between the Cochrane tool and the PEDro ≥ 8. Fair reliability was found for the comparisons between the Cochrane tool and the PEDro ≥ 6 (k = 0.36), the Cochrane and the RoB 2 (k = 0.29), and the RoB 2 and PEDro ≥ 8 (k = 0.26). Finally, inter-tool reliability between the RoB 2 and the PEDro ≥ 6 was only slight (k = 0.03).

| DISCUSSION
We have demonstrated several problems relating to the use of RoB assessment in SRs of tendinopathy management that need the attention of the research community. In our scoping review, we found that almost half of the included SRs did not use their RoB assessment in data synthesis. Additionally, only 6.5% of SRs assessed RoB on an outcome level and not a study level while only 30% of all SRs used their RoB assessment for evidence grading, which is the primary purpose of performing a RoB assessment. In light of the substantial subjectivity and lack of transparency and reproducibility that governs the conduct of SRs, we strongly recommend that future SR authors determine overall RoB for each study (on an outcome level) with the use of clear and reproducible pre-defined criteria. Whether overall RoB should be determined or not for each RCT is a controversial question and this controversy is apparent in the tools themselves. Although the creators of the original Cochrane Collaboration tool 4 advised against rating overall RoB for each study but determining overall RoB on a domain level instead, this was neither explained further with clear, reproducible instructions nor was it applicable in practice for evidence grading. The revised Cochrane Collaboration tool (RoB 2) 5 published last year includes instructions on determining overall RoB for each study; however, the creators highlight that this needs to be done on an outcome level. Finally, the PEDro scale, 6,7 which its creators define as "a scale to measure the quality of reports of RCTs," does not define specific criteria or score cutoffs and is often incorrectly labeled as a "quality assessment" and not "RoB" tool. In addition to internal validity (RoB), measures of study quality include external validity (generalizability) and precision (freedom from random error), which the 10-item scale does not include. This is also acknowledged by the creators themselves. 7 The comparison of the likelihood of each one of the three tools rating an RCT as "high overall risk" demonstrated clearly that the PEDro was overly generous as used by the SR authors, rating the majority of assessed RCTs (81.7%) as "low overall RoB"/"good overall quality." The possibility of that substantial proportion of tendinopathy RCTs actually being of "low overall RoB" is not even entertained; many of them are not double-blinded (due to their nature) and besides, the other two RoB assessment tools demonstrated greater proportions of "high overall RoB" RCTs. Finally, inter-tool reliability among the three tools was generally poor except for the comparison of the Cochrane Collaboration tool and the PEDro ≥ 8, which reinforces the need for PEDro to be used with stricter criteria.
When we assessed our own pre-defined criteria against those used by the SR authors, it was apparent that especially for the Cochrane Collaboration tool there were substantial discrepancies. One might argue that our strict criteria resulted in a very low threshold of rating an RCT as "high overall RoB"; however, the recently published RoB 2 is very close to our criteria in that respect as all it takes for a "high overall RoB" is high RoB in a single domain. These marked disparities reflect the significant effects that subjectivity, inconsistency, and lack of reproducibility can have on the results of the same SRs with regard to grading the quality of evidence. If we demonstrated inconsistencies this significant only by using different criteria for RoB assessment results as reported by the SR authors, one can imagine how much more substantial these disparities can be when the same RCTs are assessed by different people, with different tools, using different criteria for each tool. Finally, a naturally arising question is therefore "how much bias is enough to distort the true result of an RCT?"; unfortunately, this and other similarly subjective judgments are needed for the conduct and reporting of all SRs. The ideal RoB assessment tool does not exist. Subjectivity can never be removed completely from RoB assessment; however, this needs to be kept to a minimum and be complemented by transparency and reproducibility. These are exactly the aims of the revised Cochrane Collaboration tool, the creators of which state that they expect the new tool to be more likely to rate studies as "low overall RoB." 5 This was clearly not the case with the example SR used in the present review by van der Vlist et al 8 who rated none of the 29 RCTs as "low risk." Reasons for that might be either the actual presence of bias in all the included RCTs, strict thresholds used by the SR authors or poor performance of the tool itself. The same tool applied in the other SR 46 included in this review identified a much higher proportion of "low overall RoB" RCTs (4/7). Despite attempts of the creators to make the tool more user friendly and reproducible, 4 there is still significant subjectivity in some of its signaling questions (eg, "could assessment have been influenced by knowledge of intervention?" or "likely that missingness depended on true value"). However, importantly the T A B L E 3 A Our risk of bias assessment of the 29 RCTs included in the systematic review by van der Vlist (2020) 7