Bias in observational studies on the effectiveness of in hospital use of hydroxychloroquine in COVID‐19

During the first waves of the coronavirus pandemic, evidence on potential effective treatments was urgently needed. Results from observational studies on the effectiveness of hydroxychloroquine (HCQ) were conflicting, potentially due to biases. We aimed to assess the quality of observational studies on HCQ and its relation to effect sizes.

• More than half of the included observational studies (18/33, 55%) were at critical risk of bias, eleven (33%) were at serious risk and only four (12%) were at moderate risk of bias.
• The added value and quality of observational evidence regarding HCQ for COVID-19 patients should therefore be carefully considered.

Plain Language Summary
During the first waves of the coronavirus pandemic, evidence was urgently needed about potential effective treatments, such as hydroxychloroquine (HCQ). However, results from observational studies on the effectiveness of HCQ were conflicting, probably due to biases. In this study, we assessed the quality of observational studies that assessed effectiveness of HCQ treatment for COVID-19 patients with the ROBINS-I tool. This tool assesses the risk on seven domains of bias: confounding bias, selection bias due to misclassification of interventions and of outcomes, deviation from intended intervention, missing data, and completeness of reporting.
For each of the domains, the risk of bias was judged independently by two reviewers to be "Low", "Moderate", "Serious", or "Critical". The overall score per study was determined by the highest score across all the domains. We found that more than half of the included observational studies (18/33, 55%) were at critical risk of bias, eleven (33%) were at serious risk and only four (12%) were at moderate risk of bias. In addition, we studied the association between quality of the studies and estimates of effect sizes for which effects sizes found in the observational studies were compared to those from randomized clinical trials (RCTs). We did not find an association between the effect sizes and the study design. Because of the heterogeneity in the quality of observational HCQ studies, we recommend that synthesis of evidence of effectiveness of HCQ in COVID-19 patients should focus on RCTs and that the added value and quality of observational evidence should be carefully considered.

| INTRODUCTION
During the first waves of the pandemic with severe acute respiratory syndrome coronavirus (SARS-CoV-2) and its associated 2019 coronavirus disease (COVID- 19), effective treatments were urgently needed to reduce mortality, the severity of symptoms, and hospitalization rates. Doctors were forced to make choices regarding which treatments were likely to save the lives of critically ill patients, without sufficient evidence-based knowledge about effective treatments for this new disease. 1 Hydroxychloroquine (HCQ) was one of the drugs that caught the attention of researchers, clinicians, and the public early in the pandemic. Preclinical studies indicated HCQ as a potentially effective treatment for the symptoms of COVID-19 because of its in vitro antiviral effects. 2 Researchers conducted many observational studies while waiting for results from randomized controlled trials (RCTs). As HCQ was already used early during the pandemic in the treatment of COVID-19, there was an opportunity to perform observational studies using routinely collected patient data. 3 Indeed, well-designed observational studies can be helpful in the generation of hypotheses about the potential effects of drugs; however, observational studies can also provide biased results when not properly designed and analyzed. 4 The observational studies on the effectiveness of HCQ reported divergent results, from a five-fold reduction in mortality risk to an eight-fold increased risk of intensive care unit (ICU) admission, 5,6 leading to a heated debate about the effectiveness of HCQ. 6-10 Today, RCTs have convincingly shown that HCQ has no benefit in the treatment of COVID-19 patients and may even be harmful. 11,12 Variation in the estimated effects of HCQ treatment and divergence from the RCT results may arise from differences between the included patient populations. Subgroup analyzes of a meta-analysis on the effectiveness of HCQ in COVID-19 patients showed however, that disease severity or age did not lead to different outcomes; in all these subgroups, no effect of HCQ on COVID-19 outcomes was found. 13 The differences in estimated effect have therefore led to discussions on how the lack of a proper study design, short review times, and reviewers' possible lack of expertise might have led to the unjustified conclusions. 14,15 For example, many studies suffered from immortal time bias, confounding bias, or bias due to inadequately accounting for competing risks. 16,17 The aim of the current study was to provide a comprehensive overview and assess the overall quality of observational studies on HCQ conducted during the first wave of the COVID-19 pandemic, and to relate this quality to the observed effect sizes reported in completed trials. and 01/03/2021, using the search string "hydroxychloroquine AND covid-19." The included studies were peer-reviewed primary research articles, were published in English, and used an observational design to investigate in-hospital treatment with HCQ. Moreover, studies were included if they measured one of the following clinical outcomes: mortality, duration of hospitalization, need for mechanical ventilation, or time to clinical improvement. We included only studies in which HCQ (with or without azithromycin) as treatment for COVID-19 was compared to standard care. All studies focusing on the use of HCQ as prophylaxis for COVID-19 were excluded.
For each included observational study, we extracted the journal name, the impact factor of the scientific journal, and the journal ranking according to the InCites Journal Citation Reports (Q1, Q2, Q3, or Q4). 18 We also extracted the date of first submission, acceptance, and first publication; the geographical region of the study population (Africa, Asia, North America, and Europe); the study design (cohort, case-control, or other); the number of included subjects; and the proportion of study subjects treated with HCQ.

| Quality of studies
We used the ROBINS-I tool to assess the risk of bias for each of the included studies. This tool is often used for the quality assessment of observational studies. 19 It defines seven domains in which bias could occur: bias due to confounding, selection of participants, classification of interventions, deviations from intended interventions, missing data, measurement of outcomes, and selective reporting. For each of the domains, the risk of bias could be "Low", "Moderate", "Serious", or "Critical", based on the decision support table provided with the tool. 19 In order to assess the risk of bias due to confounding, we prespecified a minimal set of potential confounders that play a role in the association between in-hospital treatment with HCQ and clinical outcomes in COVID-19 patients (see supplementary materials, section A).
Since it is very likely that treatment decisions were related to factors that also influenced the outcome risk, the lowest score for this domain was "Moderate." 19 The quality assessment was performed in duplicate (MH and SB), and discrepancies were discussed. The overall score per study was determined by the highest score across the domains.

| Effect size
The effect estimates reported in the observational studies were compared to the effect estimates found in a meta-analysis of RCTs; the latter served as a reference. Since effect estimates might also differ between RCTs, we selected benchmark estimates from the meta-analysis by Siemieniuk et al.
This meta-analysis, published in the BMJ, is a living review, with the last update (at the time of study) on the 6th of April 2021. 13 The outcomes included in this meta-analysis were mortality, need for mechanical ventilation, duration of hospital stay, ventilator-free days, and time to clinical improvement. Clinically relevant effects of HCQ were not observed for any of the outcomes (Table 1). For each of the outcomes, we extracted the point estimates and corresponding confidence intervals.
We note that the effect estimates obtained from RCTs are also estimates and hence do not necessarily reflect the true effect of HCQ.
In addition, this approach cannot distinguish between deviations between effect estimates from observational studies and RCTs that arise due to bias, effect modification by patient characteristics, or (random) sampling variability. Moreover, different relative measures were being estimated, such as odds ratios (ORs), hazard ratios (HRs) and relative risks (RR). While these estimates differ in the way they are calculated, for all estimates closer to 1, no effect of HCQ is assumed, while estimates further from the null effect indicate a positive or negative effect of HCQ. Therefore, we hypothesize that in observational studies where effect estimates deviate more from those found in this meta-analysis of RCTs, the potential for bias is larger.
For each observational study, we extracted the point estimates for the primary outcome. If a study included multiple primary outcomes, we included the effect estimate for mortality, if present. For all relative measures, we subsequently calculated the extent to which this effect deviated from the benchmark estimates. This deviation was calculated as abs(log(HR_obs)log(HR_RCT)).

| Data analysis
Publication and study characteristics were described using descriptive statistics. The publication date was dichotomized as before or in June 2020, or after June 2020, which was the month in which the interim results of the RECOVERY trial were published and the FDA decided to revoke the emergency use authorization for HCQ. 20,21 For each of the included studies, we described the score for each of the domains and the overall score. In addition, we summarized the number of T A B L E 1 Estimates of the effect of hydroxychloroquine (HCQ) in COVID-19 patients from randomized controlled trials (RCTs).

Outcome
Estimate from meta-analysis/RCTs studies within each of the domains, that were considered to be at "Low", "Moderate", "Serious" or "Critical" risk of bias, or that were scored as having "No information." The relation between publication details (journal ranking [defined as Q1, Q2, Q3, or Q4], publication date, and time between submission and publication, as proxy for the peer-review process) and the overall quality of the studies as well as the relation between the effect size and the overall quality were assessed using Spearman's correlation.

| RESULTS
Our search strategy yielded 2331 hits in PubMed, 79 of which were selected on the basis of title and abstract. Of those studies, 33 were included in this review. The reasons for inclusion and exclusion are depicted in Figure 1. A list of all included studies is presented in the supplementary materials (Section B).

| Publication and study characteristics
The characteristics of the included studies are summarized in Table 2

| Quality of studies
The quality of the studies is depicted in Figure 2 and summarized per domain in Figure 3. A substantiation for the assigned scores is given in the Supplementary materials (Section C). Overall, 18 (55%) studies were considered to be at critical risk of bias, 11 (33%) were considered to be at serious risk of bias, and 4 (12%) to be at moderate risk of bias.
The domains that were most often scored as critical risk of bias were bias due to selection of participants (n = 13, 39%) and bias due to confounding (n = 8, 24%). Bias due to selection of participants was most often caused by the introduction of immortal time bias, which was present in 22 studies (67%), although in varying degree. Bias due to confounding was to some extent present in all studies due to the observational nature of the studies, but some studies had a greater risk of bias due to confounding than others. For example, studies were considered to be at critical risk of bias when there was no adjustment for confounders, if there were major differences in disease severity between the treatment groups for which no adjustment was made, or if the study adjusted for intermediates, such as the use of mechanical ventilation during hospital stay.
The domains that were most often scored as low risk of bias were bias due to deviation from intended intervention (n = 31, 94%), because any deviations observed were not beyond what would be expected in usual practice. Bias due to measurement of outcomes was also often scored as low (n = 23, 70%), because clinical outcomes like mortality are unlikely to suffer from measurement errors, although we did observe competing risk bias in two studies (6%).
The domain "bias due to missing data," was the domain which was most often scored as "No information"; 11 studies (33%) did not report how missing data were handled.

| Outcome measure and effect size
Twenty-one of 33 studies (64%) measured the effect of HCQ on mortality. Other outcome measures that were used were survival (n = 2, 6%); ICU admission (n = 2, 6%); hospital length of stay (n = 2, 6%); or a composite outcome of mortality, ventilation, and/or ICU admission (n = 6, 18% which is equal to the odds ratio of the benchmark estimate. 22

| Relation between publication details and overall quality
The results of the Spearman's correlation test estimating the association between publication details and the overall quality are summarized in Table 3. Although studies scored to be at moderate risk of bias were more often published in journals ranked as "Q1" compared to studies scored to be at serious or critical risk of bias (75% of the studies scored as moderate risk were published in a journal ranked as "Q1", for studies scored serious or critical these percentages were 45,5% and 33,3%, respectively), this relation was not significant (Spearman's rho 0.27, p-value 0.13). There were also no significant relations found between the overall quality and the length of the reviewing process or the publication date.

| Association between effect size and overall quality
The association between the effect size and the overall quality is given in Table 3. There was no significant association between the effect size and the study quality (Spearman's rho 0.20, p-value 0.31), although the deviation from the effect estimates found in RCTs was lower in studies scored at moderate risk of bias.

| Summary of findings
In this review of observational studies of HCQ treatment for COVID-19, we observed that more than half of all included observational studies were scored as critical risk of bias (18/33, 55%), eleven (33%) at serious risk and only four (12%) at moderate risk of bias. No significant associations were observed between study quality and the estimated effect or between the study quality and the journal impact factor in which the study was published. In addition, we did not find an association between the study quality and the time between submission and publication date. The time between submission and publication date was used as a proxy for the peer-review process, although time to publication may be misclassified when a manuscript has been submitted to multiple journals before acceptance. Lastly, we did not find a relation between publication date (before or in June 2020 versus after June 2020) and study quality.
In line with what others found, 16,17 we also observed that many studies suffered from immortal time bias, confounding bias, or bias due to inadequately accounting for competing risks. Immortal time analyzes in some of the included studies also suggested. Some studies assessed, for example, the impact of immortal time bias in sensitivity analyzes and found large differences in the effect estimates, with estimates changing, for instance, from 1.08 to 1.46 (deviation 0.13 on the log scale) 26 or from 0.68 to 0.82 (deviation 0.08 on a log scale). 27 Interestingly, these differences in effect estimates were neither presented nor discussed in the main text, but only presented in the supplementary materials. In contrast, we found that different methods to handle missing data or different approaches for confounder adjustment (e.g., multivariable analysis and propensity score adjustment) had a more limited impact on the effect estimates in this selection of studies; the largest deviation from the primary effect estimate was 0.07 on the log scale. [27][28][29] Furthermore, in one third of all included studies, insufficient information was reported in the article to fully comprehend all methodological choices. We observed this most often in the assessment of bias due to missing data. This poor reporting is not specific to COVID-19 research and has also been observed for pharmacoepidemiologic studies in general. 30,31 Understandably, due to word limits, authors are unable to elaborate on all methodological decisions in their manuscript. However, in situations where methods deviate from generally accepted methods, substantiation of choices that were made is needed for correct interpretation of the results of a study as well as for assessment of its validity.

| Strengths and limitations
One of the strengths of our study were the systematic and duplicate assessment analysis of all biases defined in the ROBINS-I tool and our transparent substantiation of the assigned scores in the Supplementary materials. Another strength was the use of strict inclusion criteria to define our study sample. As a result, a relatively homogeneous set of included studies was obtained, which enabled us to compare effect estimates and relate these to the study quality.
One of the limitations of our study were that our search strategy was limited to PubMed (Medline), and we did no search Embase. However, the aim of our review was not to give a comprehensive overview of all studies of HCQ and COVID-19, but rather to give an impression of the quality of such studies. Another limitation was that the quality assessment was performed on the basis of information that was reported in the publication. This is actually an indirect way of assessing the risk of bias, as the extent to which a correct assessment is possible depends on the quality of reporting. A third limitation was the fact that we used the effect estimates from the meta-analysis of RCTs as benchmark, while there can be more reasons besides the presence of bias that could explain the deviations between the estimates from the observational studies and the RCTs.
Yet, we hypothesize that in observational studies where effect estimates deviate more from those found in RCTs, the potential for bias is larger.

| Implications and recommendations
During the COVID-19 pandemic, with a high need for evidence-based therapy decisions, the results of observational studies formed the basis for clinical guidelines when RCT results were not yet available.
Although more than half of all included observational studies were scored as critical risk of bias, the promising results of some early observational studies did lead to strong recommendations to treat with HCQ in some countries or healthcare organizations, instead of waiting for the results of RCTs or high-quality observational studies. 6 improve the quality, researchers should work in multidisciplinary teams that include clinicians, methodologists, and database experts, among others, to combine their knowledge and research skills. On the other hand, guidelines for designing observational studies should be used. There are currently a number of guidelines to support the design of a pharmacoepidemiologic study and to avoid potential biases. [32][33][34][35][36] In addition to these general pharmacoepidemiologic guidelines, recommendations specific to pharmacoepidemiologic COVID-19 research were published at the beginning of the pandemic (May 5th, 2020). 4 One can also design an observational study as if it was an RCT. 37 This "emulated trial design" framework may be helpful in avoiding biases that can otherwise easily occur in observational studies. 38 Third, journals and their editors also have a responsibility to guard the quality of the studies that are published, both in the process of peer review and in their final decision regarding whether or not to publish the study results. To guarantee sufficient quality, journals should encourage or even oblige the use of checklists by authors and reviewers, such as ROBINS-I or RECORD-PE. 19,39 Moreover, reviewers must have sufficient expertise to critically review the quality of submitted studies against the presence of potential biases. As an aid, reviewer guidelines have recently been published on how to assess and interpret real-world evidence from observational studies. 40

| Conclusions
To conclude, the overall quality of observational studies on the effectiveness of in-hospital use of HCQ for the treatment of COVID-19 symptoms was heterogeneous. The urgency of situations such as a pandemic should never be an argument for conducting and publishing observational studies that are of low quality, more so because in such situations, results quickly find their way into daily practice, and the results of biased studies can have potentially harmful consequences for patients. However, the results of this study should also not be seen as a plea against observational research; instead, particularly in times of high unmet medical needs, observational studies may provide timely evidence that could be valuable, provided that the study that provides the evidence is of sufficiently high quality.

RHHG was funded by the Netherlands Organization for Scientific
Research (ZonMW-Vidi project 917. 16.430) and an LUMC fellowship.