What do we know about tocolytic effectiveness and how do we use this information in guidelines? A comparison of evidence grading

Authors


Abstract

Background

Evidence summaries of tocolytic effectiveness assign quality levels based on a single dimension: the study design. The Grading of Recommendations Assessment, Development and Evaluation (GRADE) system takes into account several domains, including limitations of the study design and ranking the importance of outcomes.

Objectives

The aim of the study was to compare the quality of evidence according to GRADE with the quality as described by existing guidelines.

Search strategy

A practitioner survey to rank the importance of outcomes and a systematic review were conducted. For the systematic review, we searched Medline, Embase, and DARE databases from inception to December 2010 using the terms ‘tocolytics’ and ‘threatened preterm labour’, without any language restrictions.

Selection criteria

Inclusion criteria for the review were randomised controlled trials comparing tocolytics with either placebo or betamimetics.

Data collection and analysis

The review and survey teams worked independently. Evidence ratings according to GRADE were performed.

Main results

The majority of the survey respondents thought that it was important to use tocolytics to buy the time needed for steroids to promote fetal lung maturation and to allow in utero transfer. Nearly 80% of ‘high’ ratings in guidelines were downgraded as a result of deficiencies identified by GRADE.

Authors’ conclusions

We propose a move away from the use of evidence rating systems reliant solely on study design, as they have a propensity towards strong recommendations when the underlying evidence is weak.

Introduction

Preterm birth is a key factor in infant survival and quality of life, so treating threatened preterm labour is important.[1] The only treatment convincingly shown to improve the perinatal outcome of preterm labour is the administration of corticosteroids to the mother before she gives birth.[2] Tocolysis is associated with a prolongation of pregnancy, which can facilitate the delivery of steroids, but on its own it has no clear effect on perinatal or neonatal morbidity.[3] Tocolysis is also an area in which there are many therapies and numerous outcomes. The controversy surrounding tocolysis in part stems from a lack of transparent evidence summaries in the guidelines that shed light on the safety and effectiveness of different tocolytic agents, taking into account the importance of each outcome separately. To our knowledge, no formal studies have been undertaken to rank outcomes for their importance.

The Grading of Recommendations Assessment, Development and Evaluation (GRADE) system takes into account several domains, including limitations of study design, inconsistency, indirectness, imprecision, and other considerations, when evaluating the quality of randomised controlled trials (RCTs). It explicitly considers ranking the importance of outcomes.[4] This approach is different from systems used in guidelines, which tend to focus only on the limitations of study design. In guidelines, even inconsistent, indirect and imprecise data can be ranked highly just because the study design is randomised. The problem with this classification of evidence is that it does not distinguish between high- and low-quality RCTs. All RCTs are automatically given a grade A for quality of evidence, but it should be possible to rank some poor-quality RCTs at the same level as an observational study. The World Health Organization states in its Handbook for Guideline Development that evidence summaries should be created by GRADE.[5]

This review demonstrates how to rate data on tocolytic effectiveness collated using a systematic review, according to the GRADE methodology, formally ranking outcomes in preterm labour in terms of their importance, and exploring how existing guidelines could be affected if GRADE were employed.

Methods

In order to compare the quality of evidence according to GRADE with the quality of evidence described in recent guidelines, we carried out both a practitioner survey to rank the importance of outcomes and a systematic review. The review and survey teams worked independently. Evidence profiles were produced according to the GRADE methodology and applied to existing guideline statements.

Practitioner survey

The survey, carried out using a specifically designed questionnaire, sought responses from obstetricians on the importance of several outcomes: perinatal morbidity; safety for the mother; perinatal mortality; avoiding birth before 34 weeks of gestation; avoiding birth within 24 and 48 hours of initiating tocolysis, to allow for the administration of corticosteroids and in utero transfer; avoiding birth before 37 weeks of gestation; and other outcomes in preterm labour and birth. A scale anchored between critical at one extreme and not at all important at the other was used to determine which outcomes, whether measured in trials or not, were critical or important. This survey helped us to focus on the outcomes ranked clinically important by at least 50% of the respondents. Other questions were ‘Is it reasonable to use tocolytics to allow steroids to work?’ and ‘Is it reasonable to use tocolytics to allow in utero transfer?’ The survey instrument was developed and piloted among ten clinicians. The items in the questionnaire were revised in light of the feedback received before conducting the reported survey. Because one of the questions was ambiguous and potentially misleading, the results for this question are not reported in this paper. The survey targeted members of OBGYN.net and attendees at The 21st Century Obstetrics Problems meeting (Clinical Management, 17–18 September 2010, Birmingham, UK). It was administered on paper as well as online and via email. The responses were collated in an excel database for analysis. Non-responders were not given reminders.

Systematic review

We searched the Medline, Embase, and DARE databases from inception to December 2010 using the terms ‘tocolytics’ and ‘threatened preterm labour’, without any language restrictions. For inclusion in the review, we required studies to be RCTs comparing tocolytics with either placebo or betamimetics, with betamimetics being the agents that have been in use for the longest time to date. Included in the study population were women with threatened preterm labour. Two reviewers independently evaluated the eligibility of the trials for inclusion. The reviewers were experienced in the assessment of study quality and had clinical experience. They independently extracted data covering study characteristics, methodological quality, and results using standardised data extraction. Any discrepancies were discussed, and if no agreement could be reached a third reviewer made the final decision. Consensus was achieved using aggregated summaries. Pooled relative risks (RRs) and corresponding 95% confidence intervals (95% CIs) of outcomes were calculated for comparable studies in which similar outcomes were assessed.

GRADE methodology

For each comparison and outcome pair, evidence quality was assessed according to GRADE and included five domains: study design, risk of bias, inconsistency, indirectness, and imprecision.

Because all evidence in this review was based on RCTs, the first domain, study design, was assigned a ‘high’ level of quality.

Risk of bias (also referred to as limitations in the study design) was present in cases with lack of allocation concealment, lack of blinding, loss to follow-up, failure to adhere to the intention-to-treat principle when indicated, and other limitations in outcome assessment (for example, stopping early for the benefit observed in RCTs, especially in the absence of stopping rules).

The third domain, inconsistency, is also described as heterogeneity of results without a plausible explanation, based on differences in populations, interventions, or outcomes. When group differences in the inclusion criteria appeared (e.g. in terms of multiple pregnancy, gestational age or ruptured membranes), or alternative tocolytics were used, the quality rating was downgraded. Conflicting results were downgraded to the lowest level.

Indirectness could refer to indirect comparison. As we only included trials with direct comparison between tocolytics and placebo or tocolytics and betamimetics, this is not applicable in this review. Indirectness could also refer to differences between the research question and the evidence available, with respect to the following components: population, intervention, comparator, and outcomes. For example, if the tocolytic dosage was much higher or lower than that approved, registered, or recommended by an agency or official body, the quality of evidence was downgraded. Also, when the effectiveness of the initial tocolysis could not be determined independently because of the use of maintenance tocolysis, the quality was downgraded.

The fifth domain, imprecision of results, referred to wide 95% CIs as a consequence of few participants or few events. The quality of evidence was downgraded in the case of a nonsignificant result or in the case of a > 25% reduction or increase in RR.

Whenever there was a deficiency, the quality was downgraded by one level if the deficiency was classified as serious and by two levels if the deficiency was classified as very serious. These judgements were subjective by their nature. In order to diminish the subjectivity as much as possible, two reviewers classified the quality of evidence independently. Discrepancies were discussed and, if no agreement could be reached, a third reviewer made the decision. The quality of evidence was ranked as ‘high’ (i.e. further research is very unlikely to change our confidence in the estimate of effect), ‘moderate’ (i.e. further research is likely to have a substantial impact on our confidence in the estimate of effect, and may change the estimate), ‘low’ (i.e. further research is very likely to have a substantial impact on our confidence in the estimate of effect and is likely to change the estimate), or ‘very low’ (i.e. any estimate of an effect is very uncertain). Graphics for effective reporting of evidence quality with GRADE were generated (Appendices S1 and S2).[6]

Comparison of quality of evidence according to GRADE with quality of evidence in guidelines

Guidelines on tocolytics in threatened preterm labour, from the UK,[3] the USA,[7] the Netherlands,[8] and Australia,[9] were used to compare our GRADE assessment with the recommendations and ranking of evidence in the guidelines. The level of evidence in these guidelines was assigned according to the study design, with recommendations based on RCTs graded A.

Results

Practitioner survey

There were 585 respondents: 112 from the UK, 168 from mainland Europe, and 305 from the rest of the world. There was no significant heterogeneity among these groups in the responses. The majority of the respondents thought that it was important to use tocolytics to buy the time needed for steroids to promote fetal lung maturation and to allow in utero transfer (Figure 1). Overall, perinatal morbidity and safety for the mother and neonate were endorsed as either critical or important outcomes by 95% of respondents, perinatal mortality and avoiding birth before 34 weeks of gestation were endorsed by 91% of respondents, avoiding birth within 24 and 48 hours of initiating tocolysis to allow corticosteroid administration and in utero transfer was endorsed by 85 and 82% of respondents, respectively, and avoiding birth before 37 weeks of gestation was endorsed by 55% of respondents.

Figure 1.

Distribution of responses of practitioners to the survey on the importance of outcomes to measure the effectiveness of tocolytic therapy, and their views on the use of tocolysis in threatened preterm labour.

Synthesis of research into tocolytic effectiveness

The search identified 1185 potentially relevant citations, the titles and abstracts of which were screened (Figure 2). Reviews and studies in fields other than tocolytic effectiveness were excluded. It was possible to retrieve the full text of 194 potentially relevant papers. After assessment of the full text, 60 RCTs were finally included: eight papers on atosiban, 13 on betamimetics, 15 on calcium channel blockers, eight on indomethacin, six on nitric oxide donors (NODs), and ten on magnesium sulphate.

Figure 2.

Identification of relevant literature on tocolytic effectiveness.

Comparison of tocolytics versus placebo

Betamimetics versus placebo

When compared with placebo, betamimetics lowered the risk of delivery within 48 hours, with an RR of 0.63 (95% CI 0.53–0.79); however, the ten RCTs from which these data were obtained were assessed as having serious limitations and inconsistency according to GRADE. The RR (95% CI) of preterm birth before 37 weeks of gestation was 0.95 (0.88–1.03), with very low quality according to GRADE. Betamimetics did not reduce the risk of perinatal death significantly. The risk of treatment cessation because of side effects was 11 times higher in the betamimetics group compared with the placebo group (RR 11.38, 95% CI 5.21–24.86). These GRADE assessments also showed serious limitations, inconsistency, and indirectness. The best quality of evidence in betamimetics studies (assessed as moderate) was found for the outcomes cerebral palsy, cardiac arrhythmia, pulmonary oedema, tremor, hyperglycaemia, hypokalaemia, and fetal tachycardia. Most of these outcomes reflect maternal side effects, the frequencies of which were significantly increased in the betamimetics group.

Atosiban versus placebo

Atosiban reduced the risk of delivery within 24 and 48 hours when compared with placebo, with more women remaining undelivered after 48 hours (RR 1.20, 95% CI 1.05–1.39). According to GRADE, there were no serious concerns regarding the quality of evidence. Spontaneous preterm birth before 37 weeks of gestation was not significantly decreased (RR 1.17, 95% CI 0.99–1.37). Perinatal mortality and admission to a neonatal intensive care unit (NICU) did not differ between the groups. Treatment cessation because of side effects was more frequent when atosiban was used compared with placebo, with an RR of 4.02 (95% CI 2.05–7.85); however, GRADE assessment showed some serious concerns of inconsistency and indirectness in these trials.

Indomethacin versus placebo

Indomethacin reduced the risk of preterm birth before 37 weeks of gestation (RR 0.21, 95% CI 0.07–0.62; with only serious indirectness according to GRADE) and delivery within 48 hours (RR 0.19, 95% CI 0.07–0.51; no serious concerns). Maternal adverse drug reaction (with low quality), and perinatal mortality and admission to NICU (both with moderate quality) were not significantly different between the indomethacin and placebo groups.

The quality of reports of several neonatal outcomes, such as necrotising enterocolitis (NEC), chronic neonatal lung disease, and neonatal sepsis, was assessed as high. The rates of these outcomes were not significantly different between the indomethacin and placebo groups, but the number of women was low, as only two trials reported these outcomes.

Nitric oxide donors (NODs) versus placebo or no treatment

Reports on the outcomes preterm birth before 37 weeks of gestation, perinatal death, and maternal adverse drug reaction were of high quality. There was no reduction in the NOD group in the frequency of preterm birth before 37 weeks of gestation or perinatal death, but the latter had a very wide confidence interval. The frequency of maternal adverse drug reaction was significantly increased in the NOD group, with RR 1.40 (95% CI 1.06–1.86). Delivery within 48 hours and neonatal death unrelated to congenital abnormalities were not significantly different between the NOD and placebo or no-treatment groups, and the quality of these reports was moderate.

Comparison of tocolytics versus betamimetics

Calcium channel blockers versus betamimetics

The reporting quality of all outcomes in the comparison of calcium channel blockers with betamimetics was very low, so these outcomes should be interpreted with care. No beneficial effect of calcium channel blockers was seen for preterm birth before 37 weeks of gestation, preterm birth before 34 weeks of gestation, delivery within 48 hours, or perinatal mortality. Treatment with calcium channel blockers was ceased less frequently than treatment with betamimetics (RR 0.10, 95% CI 0.04–0.25). In the calcium channel blocker group, fewer neonates were admitted to the NICU (RR 0.75, 95% CI 0.61–0.92). Neonatal and fetal death and neonatal sepsis were not different between groups. The frequencies of respiratory distress syndrome (RDS), NEC, and intraventricular haemorrhage (IVH) were all lower in the calcium channel blocker group.

Calcium channel blockers versus betamimetics (with maintenance therapy)

The difference between calcium channel blockers and betamimetics (including maintenance therapy) was evaluated for several outcomes. The quality of these reports was low to very low. Fewer women in the calcium channel blocker group delivered before 37 weeks of gestation (RR 0.84, 95% CI 0.73–0.98), and before 34 weeks of gestation (RR 0.76, 95% CI 0.64–0.91). Delivery within 48 hours was comparable between groups. Calcium channel blockers were better tolerated than betamimetics, as fewer women in the calcium channel blocker group had to stop treatment because of side effects (RR 0.1, 95% CI 0.03–0.31). Neonates also benefited from calcium channel blockers, with fewer admissions to NICU, and fewer cases of RDS and neonatal jaundice. The frequencies of NEC, IVH (all grades), and severe IVH (grades 3 and 4) were comparable between groups.

Calcium channel blockers versus betamimetics (without maintenance therapy)

The quality of studies reporting on the comparison of nifedipine versus betamimetics without maintenance therapy was slightly better than that of studies reporting on the same comparison with maintenance therapy, with quality varying between moderate and low. The rates of preterm birth before 37 weeks of gestation, preterm birth before 34 weeks of gestation, and delivery within 48 hours were comparable. Fewer side effects were seen in the calcium channel blocker group (RR 0.31, 95% CI 0.20–0.46), but treatment cessation did not differ (RR 0.08, 95% CI 0.0–1.3). Apparently, for the first 48 hours, treatment was better tolerated in the calcium channel blocker group. Neonatal outcomes, such as admission to NICU, NEC, and IVH (all grades), were comparable between groups.

Atosiban versus betamimetics

The frequency of preterm birth before 37 weeks of gestation did not differ between the atosiban and betamimetics groups, and the quality of the reports providing these data was assessed as high. Perinatal outcomes were also comparable between the two groups, but the quality of reports providing these data was assessed as varying between very low and moderate. Fewer women in the atosiban group stopped treatment because of side effects (RR 0.03, 95% CI 0.01–0.09), but the study reporting this finding was of very low quality.

Magnesium sulphate versus betamimetics

Compared with betamimetics, magnesium sulphate had no beneficial effect on maternal and perinatal outcomes, and no differences in side effects were reported. The quality was assessed as varying between very low and moderate.

Nitric oxide donors (NODs) versus betamimetics

When compared with betamimetics, NODs had a beneficial effect on the risk of birth before 37 weeks of gestation (RR 0.83, 95% CI 0.7–0.97). According to GRADE, several serious concerns existed in terms of quality assessment, with limitations in the study design, inconsistency, and indirectness being found. Moreover, this beneficial effect was not seen when NODs were compared with placebo (see above), and in these comparisons no serious concerns regarding quality assessment existed. Rates of preterm birth before 34 weeks of gestation, delivery within 24 and 48 hours, perinatal death, IVH, chronic lung disease, NEC, and patent ductus arteriosus (PDA) were comparable between groups. Maternal tachycardia, shortness of breath, and chest pain/tightness were seen less frequently in the NOD group compared with the betamimetics group.

Indomethacin versus betamimetics

Compared with betamimetics, indomethacin had a beneficial effect on the risk of spontaneous delivery before 37 weeks of gestation (RR 0.53, 95% CI 0.28–0.99) and on the risk of delivery within 48 hours (RR 0.27, 95% CI 0.08–0.96). For the outcome spontaneous preterm birth, quality assessment showed no concerns, whereas in reports of delivery within 48 hours there was serious inconsistency and indirectness. Compared with betamimetics, indomethacin had a lower risk of maternal side effects (RR 0.10, 95% CI 0.05–0.21). High quality was seen for RDS, NEC, and neonatal sepsis: these parameters all showed no differences between indomethacin and betamimetics. Oligohydramnios, IVH, and perinatal death also showed no differences between groups, but the quality of the reports providing these data varied between moderate and very low.

Indomethacin versus betamimetics (with maintenance therapy)

The evaluation of tocolytic therapy including maintenance therapy showed comparable rates of delivery within 48 hours in the indomethacin and betamimetics (with maintenance therapy) groups (RR 0.55, 95% CI 0.1–2.92). Side effects were seen less often in the indomethacin group: the RR (95% CI) for cessation of therapy because of side effects was 0.06 (0.01–0.5). The rate of perinatal death did not differ between the groups. Fewer adverse drug reactions leading to treatment cessation were seen in the indomethacin group (RR 0.06, 95% CI 0.01–0.5). The quality of studies varied between low and moderate.

Indomethacin versus betamimetics (without maintenance therapy)

For the comparison of indomethacin versus betamimetics (without maintenance therapy), the RR (95% CI) of delivery within 48 hours was 0.14 (0.02–1.09), with a moderate to high quality rating. The rates of perinatal death and neonatal mechanical ventilation were comparable between groups. Even without the use of maintenance therapy, fewer adverse drug reactions were seen in the indomethacin group (RR 0.24, 95% CI 0.12–0.5), but the effect on treatment cessation was not evaluated in this study.

Rating of evidence with GRADE versus ratings in guidelines

For the important outcomes, there was evidence from 39 randomised comparisons. According to GRADE, eight (20.5%) were ranked as ‘high’ quality, eight (20.5%) as ‘moderate’, eight (20.5%) as ‘low’, and 15 (38.5%) as ‘very low’. For comparison of these ratings with ratings in guidelines, four areas of recommendation in the guidelines were considered (Figure 3). In the guidelines, evidence for the 13 comparisons in these areas was rated as level I or A, as the study design was an RCT; however, comparison with ratings from GRADE showed that there was concordance in three comparisons only. All other judgements were discordant, with downgrading of evidence in some instances even to low or very low quality. Figure 4 shows a graphical representation of the GRADE profiles used for comparison with guidelines. With respect to the statement that tocolytics should be considered if the few days gained would be used for corticosteroid treatment and transfer to a perinatal centre, we found the same evidence ratings for atosiban and indomethacin; however, for betamimetics and NODs, our ratings ranked the evidence as being of ‘low’ and ‘moderate’ quality, respectively. For the statement that tocolytic drugs are not associated with a reduction in perinatal or neonatal mortality or morbidity, we found that for NODs our rating was ‘high’, matching the level–A rating in the guidelines. The evidence ratings for other agents were lower. For the other ratings, guidelines always gave a higher rating level than that obtained using GRADE.

Figure 3.

Similarities and differences between the quality assessment of evidence in guidelines and the rating derived using Grading of Recommendations Assessment, Development and Evaluation (GRADE) (see Figure 4). COX, cyclooxygenase; NO, nitric oxide.

Figure 4.

Graphical display of evidence quality and ratings for tocolytic efficacy derived using Grading of Recommendations Assessment, Development and Evaluation (GRADE). Each graph represents the quality domains shown on concentric lines. For each of the spokes, the length represents the quality, which ranges from very low at the centre of the plot to high at the end of the spoke. Concentric lines moving out from the centre show quality increasing to low and then to moderate before reaching the maximum value of high. The quality of the evidence is tabulated in Appendices S1 and S2.[1]Outcomes are ranked in importance according to the percentage of respondents who considered them critical or important in a survey (see the main text and Figure 1 for details).[2]Based on neonatal admission to an intensive care unit.[3]The trade name of atosiban is Tractocile®, Ferring Pharmaceuticals A/S (Saint-Prex, Switzerland), which is now not protected by patent. Ca, calcium channel; NO, nitric oxide.

Conclusion

Evidence summaries of the effects of tocolytic agents attached different levels of importance to the various outcomes and results reported. They assigned quality levels based on a single dimension: the study design. In the absence of consensus-based evidence rating, there is likely to be continuing controversy about tocolysis. This review addressed the effects of different tocolytics on several outcomes of preterm labour, ranging from critical to unimportant, and the quality of evidence according to GRADE Our graphs provide transparency, capturing large data volumes across many comparisons and quality dimensions.

Our work is not without limitations. We needed to focus our work on the outcomes considered to be most important, and to assess outcomes in terms of importance we chose to survey clinicians as our target group. There were some concerns about the representativeness of the findings of this survey, as the denominator was unknown and no reminders were sent out to non-responders. Given the fast-changing number of members of OBGYN.net, it is unfortunately impossible to estimate the response rate. The probably low response rate is a concern that should be balanced against the large number of responses, as consolidation of data from over 500 participants provides reliability. The validity of our questionnaire may also be considered a limitation. The term ‘validity’ has no agreed definition and comes in many forms: face, construct, criterion, and content validity, to name a few. Validity refers to how well a tool measures what it is supposed to be measuring. Validation is not just an index. It is a process to determine whether or not a tool can serve its purpose.[10-12] With this survey we wanted to capture the importance of outcomes. In our opinion, our tool has at least face, construct, and content validity on account of having been piloted before dissemination. Future research should focus on further validation of tools in this area. The fact that we did not conduct the survey among mothers, a key stakeholder group, can be considered another limitation in that their responses might have been different from those of clinicians. Although threatened preterm labour affects babies, and mothers have rights over their unborn children, their opinions could be seen as not necessarily the most pertinent in relation to determining important clinical outcomes of tocolytic therapies. Moreover, in joint decision-making, the mother is carefully counselled by the clinician in order to make an informed choice about her treatment.

Another possible limitation relates to the subjectivity in rating the quality of evidence with GRADE; however, this is a generic criticism of any evidence-grading system. The GRADE system takes more quality domains into account than the rating systems used in guidelines to date, which rely on just the study design domain. The risk of subjectivity applies to both assessments, although it is likely that the risk of misclassifying evidence quality based on study design is lower. Having said that, this is not as black and white an issue as one might initially think. For example, are studies with group allocations based on birth dates or alternation really randomised? Do studies with sealed envelopes really conceal allocation? We suspect that the biases inherent in these kinds of studies may make them more like observational studies than randomised studies. The risk of subjectivity in the GRADE system automatically leads to a certain level of disagreement between reviewers. We minimised risks associated with subjectivity by performing assessments with two reviewers, and by using arbitration from a third reviewer in cases of discrepancies. Consensus was achieved using aggregated summaries. The utility of advanced consensus methods, for example the Delphic survey, should be the subject of future research. We also provide all our assessments, transparently, for others to scrutinise. Unfortunately, we do not have data on the level of disagreement between the initial two reviewers, and therefore on the reproducibility of the grading of evidence with GRADE.

We decided to include only trials comparing tocolytics with placebo and tocolytics with betamimetics. It can be considered a limitation that we did not include cost-effectiveness studies and studies that compared other tocolytics; however, we aimed to focus on tocolytics compared with placebo, in order to allow conclusions to be drawn about the effectiveness of the tocolytics used, and on tocolytics compared with betamimetics, as betamimetics are the agents that have been most extensively used to date. Further economic evaluation will be required.

The GRADE approach is a system that is increasingly being adopted by organisations worldwide. Our review shows that the rating of the quality of evidence with GRADE can be helpful in identifying weaknesses in evidence not captured by the systems currently used in guidelines. In particular, we found that, when statements in the different national guidelines on tocolytics were graded as level A, the rating using the GRADE approach frequently downgraded the quality assessment. Studies on betamimetics are older, and were therefore more often downgraded than the trials on atosiban or NODs.

Every healthcare professional should consider, in addition to RCT evaluations and GRADE ratings, the individual needs of his or her patient, gestation at presentation, the side-effect profile, cost implications, and potential benefits of tocolytic drugs. GRADE should be used as an aid in the assessment of relevant RCTs designed to evaluate tocolytic use in preterm labour. For example, evidence from studies on indomethacin has been ranked as being of high quality, but indomethacin is rarely used in a clinical context because of serious side effects.

Our formal assessments of the importance of the outcomes, and of the rating of quality of evidence of these outcomes, incorporating a range of quality dimensions, supported some conclusions in tocolysis guidelines, and contradicted others. We propose a move away from the use of evidence-rating systems reliant solely on study design, as they have a propensity towards strong recommendations when the underlying evidence is weak.

Disclosure of interests

Uniform disclosure of potential conflicts of interest: all authors have completed the ICMJE unified competing interest form at www.icmje.org/coi_disclosure.pdf (available from the corresponding author), and declare that: (1) K.S.K. and the Arcana Institute received grants for part of this work from Ferring Pharmaceuticals; (2) K.S.K. and B.W.M. had travel expenses reimbursed by, and received honoraria for delivering educational presentations from, various official Obstetrics and Gynaecology bodies, and received consultancy fees from Ferring Pharmaceuticals. All authors also declare that: (3) no spouses, partners or children have relationships with commercial entities that might have an interest in this work; (4) they have no non-financial interests that may be relevant to this work.

Contribution to authorship

K.S.K. conceived the idea for the graphical display and refined it with input from J.W. initially, and subsequently from all co-authors. GRADE assessment for this review on tocolytic effectiveness using GRADEpro 3.2.2 was conducted by E.B., M.K., A.Z., and C.R., who also prepared the graphical display. C.R. also wrote the initial draft of the article and all subsequent drafts after critical review by all co-authors, and with input from the EBM-CONNECT Collaboration. All co-authors had significant input in the preparation of the article and the analysis. C.R. is the guarantor for the article.

Details of ethics approval

No ethical approval was needed for this review.

Funding

We received funding from the European Union made available to the EBM-CONNECT Collaboration through its Seventh Framework Programme, Marie Curie Actions, International Staff Exchange Scheme (proposal no. 101377; grant agreement no. 247613); EBM-CONNECT Canadian Collaborators received funding from Canadian Institutes of Health Research, and part funding from Ferring Pharmaceutical, to undertake the review. None of the funding providers played a role in the planning and execution of this work, or in drafting of the article.

Acknowledgements

We thank the EBM-CONNECT (Evidence-based medicine collaboration: network for systematic reviews and guideline development research and dissemination) Collaboration, in alphabetical order by country: L. Mignini, Centro Rosarino de Estudios Perinatales, Argentina; P. von Dadelszen, L. Magee and D. Sawchuck, University of British Columbia, Canada; E. Gao, Shanghai Institute of Planned Parenthood Research, China; B.W. Mol and K. Oude Rengerink, Academic Medical Centre, the Netherlands; J. Zamora, Ramon y Cajal, Spain; C. Fox and J. Daniels, University of Birmingham, UK; K.S. Khan, S. Thangaratinam, and C. Meads, Barts and the London School of Medicine, Queen Mary University of London, UK.

Editor's Commentary on What do we know about tocolytic effectiveness and how do we use this information in guidelines? A comparison of evidence grading

Roos et al. report a study of the use of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for grading clinical evidence for the development of clinical guidelines based on randomised controlled trials of tocolytics versus placebo or β-mimetics. The GRADE system (GRADE working Group. BMJ 2004;328:1490–4) has been proposed as an alternative method of grading clinical evidence to the previously used system developed by the US Agency for Health Care Policy and Research (AHCPR, now the US Agency for Health Research and Quality, AHRQ) (AHCPR Publication No. 92-0032. Rockville, MD: AHCPR; 1992). The GRADE system has been endorsed for use in guideline development by various agencies such as National Institute for Health and Clinical Excellence (The Guideline Manual. London: NICE; 2009) and Scottish Intercollegiate Guidelines Network (SIGN 50. A Guideline Developer's Handbook. Edinburgh: SIGN; 2011) and organisations such as the World Health Organization (Handbook for Guideline Development. Geneva: WHO; 2010) and ranks evidence to either of the following categories: high, moderate, low or very low quality.

The main advantage with the GRADE approach is the ability to highlight limitations in study design that may potentially introduce bias into the study results, inconsistency of results across different studies, indirectness of the evidence and imprecision in the estimates of effect size because of insufficient number of participants recruited into the selected studies (GRADE working Group. BMJ 2004;328:1490–4) when considering the grading of the guideline evidence statements and recommendations. Under the AHCPR grading system (AHCPR Publication No. 92-0032. Rockville, MD: AHCPR; 1992), randomised studies are invariably assigned to grade 1 whereas observational evidence is usually allocated to grade 2. Therefore, evidence is ranked merely according to the study design without any consideration for any possible deficiency in the conduct of the study or in the results with this grading system. In contrast, with the GRADE system, it is possible to downgrade a randomised study from a high level to a lower grading level if there is concern with the study design and results as described above (GRADE working Group. BMJ 2004;328:1490–4). Conversely, evidence from well-conducted observational studies can be upgraded to a higher level compared with that generated by poorly conducted randomised studies.

On the other hand, the use of the GRADE system does introduce an element of subjective judgement in deciding whether the above limitations in the evidence are present. To address this problem, Roos et al. have employed two reviewers to independently classify the evidence on tocolysis, with any disagreement being resolved by consensus or arbitration by a third reviewer when there is still lack of agreement following discussion.

The results from the study by Roos et al. showed that for four guideline recommendations in four published clinical guidelines based on 13 randomised comparisons, there was complete concordance between the grading of these evidence statements in only three (23%) of these comparisons between the GRADE and AHCPR grading systems (Figure 3 in Roos et al.). For the majority of the remaining randomised comparisons, the recommendation statements were accorded a lower grading level by GRADE compared with the AHCPR system.

Another consequence of using the GRADE system to grade clinical evidence is the possibility that older studies tend to be graded to a lower quality because of the lack of awareness of the deficiencies in study design and interpretation of data at the time when these studies were conducted. Hence, for the outcomes of perinatal morbidity/mortality, safety, preventing delivery of the baby within 24 and 48 hours after the initiation of tocolytic therapy (which were ranked to be the most clinically important by participants of the previously piloted survey of clinicians), studies comparing β-mimetics versus placebo and calcium-channel blockers versus β-mimetics tended to be graded lower under this system compared with those comparing atosiban with placebo (Figure 4 in Roos et al.). This observation is possibly because studies with β-mimetics (either as the interventional or control group) tended to be conducted earlier (and hence performed less robustly) compared with those that evaluated atosiban as a tocolytic, which tended to have been undertaken more recently.

Developers and users of clinical guidelines will probably have to familiarise themselves with the GRADE clinical grading system as guidelines generated in the future will be more likely to employ this method for grade clinical evidence. The greater complexity associated with the use of the GRADE system is not an excuse to avoid using it over the older AHCPR system.

Disclosure of interests

The author is a scientific editor for BJOG, member of SIGN Strategy Group and a deputy on SIGN council. Otherwise, the author declares that there is no conflict of interest.

  • P ChienEditor

  • Ninewells Hospital and Medical School, Dundee, UK

Ancillary