A re‐analysis of 150 women's health trials to investigate how the Bayesian approach may offer a solution to the misinterpretation of statistical findings

To investigate whether a Bayesian interpretation might help prevent misinterpretation of statistical findings and support authors to differentiate evidence of no effect from statistical uncertainty.

example in the case of a binary outcome, includes reporting the proportions with the outcome of interest in each arm, a corresponding absolute or relative difference and its 95% confidence interval. 6Reporting guidelines, such as the CONSORT statement, include confidence intervals as a minimum reporting requirement. 7Unfortunately, many researchers ultimately interpret the primary and other key outcomes based on whether the confidence interval includes the null -and are thus implicitly reverting to interpretation based on statistical significance. 8,9he following two case studies illustrate the problem.The INFANT trial (n = 46 614) evaluated the use of computerised interpretation of cardiotocographs on the occurrence of adverse neonatal outcomes. 10The primary outcome occurred in 171/23 351 (0.73%) in the treatment arm versus 172/23 263 (0.74%) in the control.The difference was not statistically significant, with a risk ratio of 1.01, 95% confidence interval (CI) 0.82-1.25.In the conclusion, this result was interpreted as 'continuous electronic fetal monitoring in labour does not improve clinical outcomes'.This might be construed as misinterpretation of statistical significance. 6However, inspection of the finding on the risk difference scale (percentage points risk difference = −0.00,95% CI −0.16 to 0.15) reveals any difference in outcomes is almost certainly smaller than half a percentage point (upper bound of 95% confidence interval 0.15 percentage points) in adverse neonatal outcomes.Yet, implicit in this interpretation is that this small reduction is not clinically important.Thus, not reporting effect sizes on a clinically interpretable scale and not explicitly interpreting the range of effect sizes supported by the confidence interval, the mechanism by which this conclusion was reached is not transparent.This is problematic, because it lends itself to a perpetuation of misinterpretation in other smaller trials, as well as making assumptions about sizes of effects that are clinically important without making this explicit.
To illustrate how non-statistically significant results are often misinterpreted, we consider a second case study.This trial compared titrated-dose oral misoprostol (intervention) with static-dose to increase the likelihood of a vaginal birth.The risk ratio was 0.98 (95% CI 0.77-1.24)based on 47/73 events (64%) in the treatment arm and 48/73 (66%) events in the control arm. 11Similar to the first case study, this primary outcome result was interpreted as evidence of 'similarity'.Yet, in this trial the difference in percentage points was −1.36 (95% CI −16.83 to 14.09).This confidence interval indicates considerable uncertainty, providing evidence that this intervention might either increase or reduce this outcome by a considerable amount.Thus, in this example, the interpretation of the primary outcome as showing evidence of 'similarity' is highly misleading -a more accurate interpretation is that unfortunately the study is too small to tell us anything conclusive.
These case studies illustrate how non-statistically significant outcomes can be misinterpreted as evidence of no effect; and, moreover, even when results are sufficiently precise to rule out clinically important effects, trialists still persist in interpreting key outcomes based on statistical significance. 12-14

| OBJEC TI V E S
To illustrate how a Bayesian approach might help mitigate some of the problems around the misinterpretation of statistical findings, we undertook a Bayesian re-analysis of a contemporary sample of women's health randomised trials with binary primary outcomes.We first illustrate how the Bayesian and frequentist analyses show strong concordance.We then formulate a mechanism for how a Bayesian interpretation can be implemented by introducing the concept of the strength of statistical evidence and clinically important effect sizes.We illustrate the approach for an example set of large, moderate, small and trivial effect sizes (as well as unanticipated harm), and varying degrees of strength of statistical evidence.We contrast the interpretation of the Bayesian analysis with that from a frequentist interpretation.

| Search strategy
We identified individually randomised, two-arm superiority trials (1:1 randomisation ratio) with a binary primary outcome, whose primary report of findings was published in one of seven English language high-impact general medical and specialty journals, between January 2015 and December 2020: New England Journal of Medicine, Lancet, JAMA (the Journal of the American Medical Association), BMJ, BJOG (British Journal of Obstetrics and Gynaecology), American Journal of Obstetrics & Gynaecology and Obstetrics & Gynaecology.We included trials evaluating pharmacological and non-pharmacological interventions targeted at women to improve fertility, maternal or fetal, or perinatal outcomes.We made no restrictions on the type of comparator or setting, but excluded non-inferiority and equivalence trials.We made a post-hoc decision to exclude any trials with zero events (or 100% with the event), in either of the study arms, and studies where the primary outcome was unclear.The searches were conducted in EMBASE and MEDLINE on the Ovid platform, restricting the journal name (to one of the seven included journals) and limiting the search to randomised controlled trials published between 2015 and 2020.The list of identified studies was imported into Covidence.An initial title and abstract screen were performed, followed by a full text screen.All screening was conducted independently and in duplicate (PM and RL), with discrepancies resolved by discussion or, where needed, arbitration by a third author (KH or MT).The protocol for the review is registered on PROSPERO (PROSPERO 2021 CRD42021236171).

| Data extraction
Where available, we extracted absolute event numbers (i.e.numerators and denominators in each arm) for the primary analysis of the primary outcome; where authors only reported denominators and percentages, these were extracted instead.We also extracted the journal; intervention type, classified as pharmacological, procedural (e.g. a surgical technique or type of dressing), non-pharmacological (e.g.psychotherapy, or lifestyle changes), diagnostic or a mixture; and the primary outcome type (classified as adverse fetal outcome, adverse maternal outcome, live birth or other).We also classified each trial as to whether higher or lower event rates were desirable (e.g.reduction of adverse fetal outcomes or increased detection of adverse fetal outcome).Two authors (PM and RL) independently extracted data in duplicate and resolved any discrepancies by discussion.

| Data analysis
We used the extracted or derived number of events and total sample size for each arm to create individual level data for each trial.The contrast of interest is that of the absolute or relative difference between the proportion with the outcome in the treatment arm versus control arm.For trials in which lower event rates are desirable, negative values suggest benefit of the intervention; for trials in which higher event rates are desirable, we reversed the calculation.Thus, all absolute differences less than 0 (or relative differences less than 1) were indicative of treatment benefit.
For each trial these data were then used to estimate the risk ratio and risk difference, 95% confidence intervals (CI) and P-values under a frequentist approach.This analysis was implemented in STATA 17 using the cc function (STATA 17). 15ny cases of non-convergence were noted.For the Bayesian analysis we estimated risk ratios, and risk differences using binomial regression with a log link, and binomial regression with an identity link, respectively.We used a vague prior throughout (normal distribution with mean zero and standard deviation 10 000) to model the risk difference or log risk ratio.We report point estimates and associated 95% credible intervals (CrI).This analysis was implemented in STATA 17 using the bayes function with default options (Metropolis-Hastings algorithm using 12 500 iterations removing the first 2500 burn-in iterations, no thinning and starting points based on iterative reweighted least-squares estimates).Again, any cases of non-convergence were noted.
We then determined, based on the Bayesian model, the posterior probabilities of a large, moderate and small beneficial effect, and evidence of at most a trivial effect.For illustration only, we defined a large beneficial effect to be a risk difference greater than (−) 4 percentage points (pp); a moderate beneficial effect as a risk difference greater than (−) 1 percentage points; a small beneficial effect as a risk difference greater than (−) 0.5 percentage points; and a trivial effect to be within 0.5 percentage points difference (either way) from the null (Figure 1).We defined an unanticipated harmful effect as a difference of at least 0.5 pp in the unanticipated direction (i.e.harm).In addition, we calculated the posterior probability of a risk difference greater than 0 percentage points ('any beneficial effect').To estimate these posterior probabilities, we used the bayestest interval command, which uses the simulated posterior distribution of model parameters estimated using the bayes command.
These cut-points for large, moderate and small effects are used for illustration only, and we suggest in practice these be F I G U R E 1 Proposed classification of large, small, moderate and trivial effect sizes.
grounded by effect sizes of clinical importance in the particular trial context.Working on scales that are known to be more interpretable can help to this end; consideration of effect sizes of other common interventions might also help.For example, the values we have chosen are equivalent to numbers needed to treat (NNT) of 25, 100 and 200, respectively.The use of aspirin for stroke prevention has an NNT in the region of 300 over 10 years 16 ; the use of aspirin after stroke has an NNT in the region of 150 over 3 years to prevent a non-fatal heart attack; 17 whereas the use of dexamethasone in COVID-19 has an NNT in the region of 40 (RECOVERY, 2021). 18Thus, although our choice is to some extent arbitrary and not context-specific, these values are unlikely to be very dissimilar to those chosen in practice.By evaluating 'trivial effects' we implicitly consider evidence of no benefit.
We then quantified the strength of the statistical evidence of this range of effect sizes.We suggest posterior probabilities >95% might be considered as strong statistical evidence, posterior probabilities between 90% and 94% are classified as moderate statistical evidence, and anything <90% is classified as weak statistical evidence.In a sensitivity analysis we set 97.5% as the cut-point for strong statistical evidence, 95% for moderate statistical evidence and anything <95% for weak statistical evidence.Conventionally posterior probabilities are reported without any such categorisation, [19][20][21][22] although others have also proposed categorising, for example using >80%, 90% or 95% posterior probabilities as convincing evidence. 23,24Finally, we classified the overall statistical evidence as strong if there was strong statistical evidence of either at least a small effect (which includes moderate and larger effects), a trivial effect or an unanticipated harmful effect.

| Characteristics of included trials
The search was performed on 4 March 2021 (Figure 2); the characteristics of the 150 trials, published between 2015 and 2020, and assessed to be eligible are summarised in Table 1.The studies were roughly evenly distributed across the seven journals, albeit with proportionately fewer published in both JAMA (12, 8%) and the BMJ (9, 6%).Most were testing a pharmacological intervention (59, 39%), a procedural intervention (48, 32%) or a non-pharmacological intervention (27, 18%).The most common outcome type was either adverse fetal (27, 18%) or adverse maternal outcomes (46, 31%).The average prevalence of the outcome (in the control arm) was 22% (interquartile range [IQR] 10-41%).The majority (98, 65%) of the trials were trying to reduce the primary outcome (e.g.reduction in adverse fetal outcome) and, in a smaller number (52, 35%), the objective was to increase the primary outcome (e.g. increase the live birth rate).For those 52 trials with an objective to increase the primary outcome, the comparisons that follow relate to control-intervention rather than intervention-control.The median number of participants randomised (total across both arms) was 503 (IQR 238-1092).

| Frequentist and Bayesian results
Of the 150 trials, approximately a third (48, 32%) were statistically significant according to our frequentist reanalysis (Table 2).Across all 150 trials under the Bayesian reanalysis, the average percentage point difference was −1.73 (IQR −7.18 to 0.77) and the average risk ratio was 0.92 (IQR 0.73-1.07)(Table 2).When estimating the risk ratio and risk difference, the occurrence of non-convergence was low (4% and 0%, respectively).The average posterior probability of any beneficial effect (risk difference ≤0) was 0.79 (IQR 0.40-0.99).The frequentist and Bayesian approaches all led to similar inferences, as indicated by the similarity of the point estimates and confidence intervals/credible intervals, and this was the case for both absolute and relative measures of effect (Figure S1, Table 2).Thus, there is a strong one-to-one alignment between the two analytical approaches.

| Classification of strength of statistical evidence
Among the 102 non-statistically significant trials, eight (8%) had strong statistical evidence whereas 94 (92%) of the studies yielded moderate or weak (posterior probability <95%) statistical evidence (Table 3).Of the eight trials classified as having strong statistical evidence, three (3%) had strong statistical evidence (posterior probability ≥95%) of at least a small benefit (NNT <200).None of these had strong statistical evidence of large benefit (NNT <25) or moderate benefit (NNT <100).A further two (2%) were classified as having strong statistical evidence of a trivial effect (percentage point difference within 0.5 of 0) and three (3%) were classified as having strong statistical evidence (posterior probability ≥95%) of harm (percentage point risk difference ≥0.5 pp).
Of those trials where the primary outcome was statistically significant, 47 (98%) were classified as having strong statistical evidence (Table 3).Moreover, many (40, 83%) were classified as having strong statistical evidence of at least a moderate effect, and many (24, 50%) were classified as having strong statistical evidence of a large effect.In addition, five (10%) of statistically significant trials were classified as having strong statistical evidence of an unanticipated harmful effect, although these would also have been interpreted as evidence of harm under the frequentist approach.
Over all 150 trials there was strong statistical evidence (posterior probability ≥95%) in around one-third of the studies (55, 37%).When we increased the stringency of the statistical evidence so that strong statistical evidence was classified as posterior probabilities >97.5%, the certainty of all conclusions decreased: the proportion of trials classified as having strong statistical evidence decreased from 37% to 29% (Table S1, Figure 3).In only two of the trials was there strong statistical evidence of a trivial effect.Of the 102 nonstatistically significant trials, in only two (2%) was there strong statistical evidence (both for a trivial effect).

| Summary of findings
3][14] A Bayesian approach to interpretation might help distinguish those studies for which there is evidence the intervention does not work, from those for which the studies were probably too small and the resulting findings inconclusive.We first provide reassurance that the two analytical techniques have a strong one-to-one correspondence.Secondly, we illustrate how a Bayesian interpretation of the 102 statistically non-significant trials in this sample, can differentiate those for which there is statistical evidence of no effect (a small minority) from those for which there is considerable statistical uncertainty (the majority).
The approach requires the specification of effect sizes that are clinically important.Although this is not expected to be an easy task, the Bayesian approach is transparent about this, whereas the frequentist approach mostly ignores this is a necessary condition for interpreting confidence intervals. 25 the other hand, the frequentist strict interpretation of P-values ensures that the floodgates do not open for declaring any intervention as effective -whereas the Bayesian approach proposed here might be viewed as opening the gate a little more to prevent it being shut on interventions that might well be effective.

| Detecting uncertainty
Although the 95% confidence intervals and 95% credible intervals were highly consistent across the two approaches, the Bayesian approach to interpretation allowed identification of trials with wide confidence intervals which supported both benefit and harm, and were thus inconclusive.This applied to around two-thirds of the studies in this sample.Our results underscore that many studies are under-powered to detect small but still meaningful effects (the average total sample size was in the region of 500).Returning to the second case study, the comparison of titrated-dose oral misoprostol (intervention) with static-dose oral misoprostol (control), where the reported risk ratio for the event of vaginal delivery was 0.98 (95% CI 0.77-1.24)based on 47/73 events in the treatment arm and 48/73 events in the control arm. 11Here the strength of statistical evidence is <60% for all effect sizes, thus the Bayesian interpretation here is that the findings of this study are uncertain.

| Detecting small effects
In a handful of studies, we were able to identify evidence of a trivial impact (that is, an effect size so small as to almost certainly not be of clinical importance), which we defined as a number needed to treat >200, but which in practice can be smaller or larger depending on the nature of the specific setting, intervention and outcome.The INFANT trial that included nearly 50 000 participants, with the primary outcome occurring in 0.7%, that was not statistically significant (adjusted risk ratio 1.01, 95% CI 0.82-1.25),was a candidate study for being able to demonstrate no impact. 10For this study the posterior probability of a trivial effect (number needed to treat >200) was 100%.Although trials indeed need to be very large definitely to rule out small effects, this example nicely illustrates how the Bayesian approach can help with a definitive interpretation of a non-statistically significant outcome.

| Unanticipated harmful effects
Although our focus was on posterior probabilities of beneficial effects, it is possible that an intervention which is hypothesized to bring about benefit, can actually have a harmful effect.We do not necessarily suggest that evaluating posterior probabilities of harmful effects should be routine, as posterior probabilities of benefit in such settings would be low.Nonetheless, we did identify that a minority of trials had strong statistical evidence of effects in the unanticipated direction.For example, in one trial delaying infertility treatment after a 6-month lifestyle-intervention programme in obese women, statistically significantly reduced, rather than increased, the proportion of women having a vaginal birth within 24 months. 26In practice we suggest that if trialists did observe a potential harmful effect, it could be useful to examine the probability of large, moderate or small harmful effects.

| Statistical versus scientific evidence
Our classification of the strength of statistical evidence was concerned with the inference based on the primary outcome.In practice, researchers must consider much wider influences -for example, the scientific rigour of the trial, the context, costs and potential harms of the treatment. 4,5 have not considered these factors but have instead tried to provide researchers with effective tools properly to interpret key outcomes.Only after key outcomes have been interpreted can investigators properly consider the wider implications of whether the intervention should be used.Thus, although we have illustrated this technique on a sample of real trials, we do not attempt to make inferences about specific interventions, and for these reasons we have not undertaken a risk of bias assessment and do not recommend our results be used to inform treatment decisions.

| Retaining reproducibility
We used an arbitrary classification for the strength of the statistical evidence.When we increased the stringency of the statistical evidence by classifying posterior probabilities  >97.5% as strong statistical evidence, the number of studies for which it was possible to conclude something definitive decreased.Lowering thresholds for strength of statistical evidence might lead to increases in non-reproducible results.Relatedly, a similar approach could be undertaken using P cut-points.However, the frequentist approach is tightly woven within a paradigm that strongly controls type-1 error (claiming there is an effect when it is a chance finding) and, as a consequence, opens the floodgates for type-2 errors (claiming there is no effect where one exists).Thus, whatever approach adopted, care must be taken to ensure both types of errors are controlled.There are of course other ways to control type-1 errors, such as prespecification of primary outcomes and anticipated effects, as well as showing reproducibility in other settings.As with any classification system, the pros and cons of misclassification depend on context. 27For example, very stringent evidence might be required before the acceptance of an invasive surgical procedure, but perhaps less convincing evidence might be acceptable before recommending a lowcost, low-harm, non-invasive therapy. 4,5

| Classification of size of effects
We have used somewhat arbitrary classifications for clinically important and trivial effect sizes. 28We thus suggest that with appropriate contextual knowledge, clinically important effect sizes should be defined at the planning stage. 23reating an explicit necessity to specify clinically important effect sizes up front, should prompt decision makers to think about this important question at the planning stage rather than the interpretation stage.Although we only consider binary outcomes, the methods proposed can readily be extended to continuous outcomes, where the concepts of clinically important differences are often better established. 29

| Accessibility and implementation
Frequentist inference is by far the predominant method of inference (Gupta 2012). 13,30,31Unlike the frequentist approach, a Bayesian analysis requires specification of prior distributions and this might be a perceived barrier to its use. 22In this application we used standard informative priors illustrating how the approach can be used without dependency on 'priors', which might induce concerns of lack of reproducibility. 32The finding that the Bayesian and frequentist point estimates, confidence/ credible intervals and Ps/posterior probabilities showed strong concordance gives confidence that inferences are not dependent on the chosen prior. 19,20,21,22,33Furthermore, the Bayesian approach is pitched here as an aid to interpretation and not as a technique that will radically change the numerical results; thus it might even have a place alongside a conventional frequentist analysis.However, the approach might also be used in conjunction with an informative prior, fully embracing the Bayesian philosophy, and this might be particularly important in rare diseases or interventions in difficult to recruit populations.

| Generalisability
Our review was limited to trials in the area of women's health, but the proposal and its implications should be generalisable other clinical areas with binary outcomes, albeit perhaps with some reconsideration of what constitutes clinically important effect sizes.In addition, our review was limited to trials in high impact journals, which might suggest that the true proportions of trials with statistically significant findings (~one-third) or with strong statistical evidence (~one-third) in the wider medical literature might be lower than in our review.As others have suggested, when considered from a perspective of clinically important effects, there is no real difference in superiority, non-inferiority and equivalence trials. 25We thus suggest this approach could be used for the interpretation of non-inferiority as well as superiority trials. 34

| CONCLUSION
The key findings of most randomised trials are interpreted on the basis of statistical significance -leading to many interventions being declared as ineffective when the findings are statistically uncertain (type-2 error).This is a well-known problem.In part, this problem of misinterpretation arises because a strict frequentist interpretation of statistical significance prioritises not misclassifying treatments as effective when they are not (type-1 error).In so doing, this perpetuates the problem of treatments being declared as ineffective when they are actually uncertain.A Bayesian interpretation of findings, alongside reporting of confidence intervals and effect sizes, may help strike a balance between minimising both types of errors.

DATA AVA I L A BI L I T Y S TAT E M E N T
Data are available on request.

E T H IC S A PPROVA L
No ethical approval was obtained for this study, which is a review and therefore no ethical review is needed.

T A B L E 3
Classification of trials based on strength of statistical evidence of important beneficial effect sizes.Overall strength of statistical evidence a

F I G U R E 3
Classification of trials into clinical important effect sizes: at (A) 95% and (B) 97.5% for strong statistical evidence.
KH, MT and AC led the development of the idea.KH analysed the data and wrote the first draft of the article.PM and RL undertook the search and data abstraction.PM led the development of the associated protocol for the review.All authors made an intellectual contribution to the development of the ideas and commented on draft versions of the paper.AC K NO W L E D GE M E N T SNone.F U N DI NG I N FOR M AT IONThis research was partly funded by the UK NIHR Collaborations for Leadership in Applied Health Research and Care West Midlands initiative.Karla Hemming is funded by a NIHR Senior Research Fellowship SRF-2017-10-002.This research is independent to the funder.
Examination of consistency of inferences between Bayesian and frequentist approaches.
T A B L E 1 Characteristics of included studies.IQR, interquartile range.aAverageprevalence in the control arm.T A B L E 2a Subsequent summaries are presented over results that converged.*P ≤ 0.05.
Overall statistical evidence classified as strong if strong statistical evidence of either at least a small effect or a trivial effect or an unanticipated harmful effect.Italics are non-mutually exclusive categories. a