SEARCH

SEARCH BY CITATION

Keywords:

  • Cardiotocography;
  • fetal electrocardiogram;
  • inter- and intra-observer agreement;
  • ST analysis

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Objective  The objective of this study was to quantify inter- and intra-observer agreement on classification of the intrapartum cardiotocogram (CTG) and decision to intervene following STAN guidelines.

Design  A prospective, observational study.

Setting  Obstetrics Department of a tertiary referral hospital.

Population  STAN recordings of 73 women after 36 weeks of gestation with a high-risk pregnancy, induced or oxytocin-augmented labour, meconium-stained amniotic fluid or epidural analgesia.

Methods  Six observers classified 73 STAN recordings and decided if and when they would suggest an intervention. Proportions of specific agreement (Ps) and kappa values (K) were calculated.

Main outcome measures  Agreement upon classification of the intrapartum CTG and decision to perform an intervention.

Results  Agreement for classification of a normal and a (pre)terminal CTG was good (Ps range 0.50–0.84), but poor for the intermediary and abnormal CTG (Ps range 0.34–0.56). Agreement on the decision to intervene was higher, especially on the decision to perform ‘no intervention’ (Ps range 0.76–0.94). Overall inter-observer agreement on the decision to intervene was considered moderate in five of six observer combinations according to the kappa (K range 0.42–0.73). Intra-observer agreement for CTG classification and decision to intervene was moderate (K range 0.52–0.67 and 0.61–0.75).

Conclusions  Inter-observer agreement on classification of the intrapartum CTG is poor, but addition of information regarding fetal electrocardiogram, especially in case of intermediary or abnormal CTG traces, results in a more standardised decision to intervene.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Since the introduction of continuous fetal heart rate (FHR) monitoring in the 1960s, there has been a wide variation in interpretation of FHR patterns and therefore in clinical decision-making.1 Although FHR monitoring, using the intrapartum cardiotocogram (CTG), did not live up to its original expectations, it is still the primary method to monitor fetal wellbeing during delivery and is widely used.2 One of the major disadvantages of the intrapartum CTG is its low specificity, that is many false-positive test results for poor neonatal outcome. Additional techniques for fetal surveillance have been developed, notably fetal blood sampling (FBS).3 For several reasons, however, FBS is not widely applied.1,4–6 A relatively new method for continuous fetal monitoring is the STAN methodology (Neoventa Medical, Gothenburg, Sweden) in which (classification of) the CTG is combined with ST analysis of the fetal electrocardiogram (ECG). Abnormalities in the ST segment of the fetal ECG are related to metabolic acidosis of the fetus.7,8

The STAN technology seems promising by reducing metabolic acidosis, incidence of FBS and operative deliveries.2,9–12 But, as the intrapartum CTG is still part of the STAN technology and variation in interpretation of this CTG is large,13,14 guidelines for its classification are important. Over time, many guidelines have been introduced, of which the FIGO (International Federation of Obstetrics and Gynecology) guidelines for fetal monitoring have probably reached best consensus.15,16 Current guidelines on STAN methodology are based on these FIGO guidelines to guide labour ward staff to systematically assess and classify a FHR trace (Appendix). In practice, however, also these STAN guidelines have limitations, which mostly concern (variability in) interpretation of the CTG.17,18

Two studies have shown a better inter-observer agreement—with respect to the decision to intervene—for fetal monitoring using CTG plus ST analysis of the fetal ECG in comparison with monitoring by CTG alone.19,20 A study in the USA on STAN usage showed high percentages agreement on intervention decisions.21 Interestingly, no study has yet been performed that examined reproducibility or intra- and inter-observer agreement regarding assessment or classification of the intrapartum CTG, together with ST information of the fetal ECG, according to the STAN clinical guidelines.

The aim of our study was to systematically quantify the inter- and intra-observer agreement upon classification of the intrapartum CTG and of the decision to perform an intervention following the clinical guidelines of the STAN methodology (Appendix). Furthermore, we studied the association between the level of experience with use of STAN and the inter- and intra-observer agreement.

Materials and methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Patients

Seventy-five intrapartum STAN recordings were selected from a large database of 637 women who were monitored using a STAN S 21 fetal heart monitor (Neoventa Medical).22 The selection consisted of 11 STAN recordings of deliveries complicated by metabolic acidosis and 64 randomly chosen STAN recordings. Metabolic acidosis cases were defined as a cord artery pH < 7.05 and base deficit in the extracellular fluid compartment (BDecf) > 12.0 mmol/l using the Siggaard-Andersen acid–base chart algorithm.23

The equipment and use of the STAN method have been described elsewhere.7–9 Women were eligible after 36 weeks of gestation and had high-risk pregnancies, induced or oxytocin-augmented labour, meconium-stained amniotic fluid or epidural analgesia. Deliveries were managed by registrars or midwives under supervision of a gynaecologist. Two recordings were excluded because of technical problems and poor signal quality, leaving 73 STAN recordings for analysis.

Observers

Six observers, divided into three categories according to their level of experience with intrapartum ST analysis, were asked to participate in the study. Observer category A contained two ‘expert’ gynaecologists with at least 15 years of clinical experience in obstetrics and about 7 years experience with intrapartum ST analysis (A1 and A2), which were considered the ‘reference observers’ in the inter-observer analyses. Observer category B contained two senior registrars with at least 3 years of clinical experience in obstetrics and at least half a year of experience with intrapartum ST analysis (B1 and B2). Observer category C contained two junior registrars with at least 1-year of clinical experience in obstetrics and less than half a year of experience with intrapartum ST analysis (C1 and C2).

All observers are medical doctors in a tertiary referral centre with 1800 deliveries per year and daily use of the STAN method. They attended a standard user training at time of the introduction of the STAN method or at the moment of their introduction in the hospital. They attended monthly case analysis meetings as standard part of the clinical STAN training. Before the start and during the period of this study, observers were not additionally trained.

Measurements

From each STAN recording, the last 2 hours were selected. These 2-hour periods were subdivided into four parts of 30 minutes (t1–t4), printed on paper, using a paper speed of 2 cm/minute, and consecutively presented to the six observers.

In November 2006 (T0), all six observers were asked to score the following two outcomes for each 30-minute recording:

  • 1
    Classification of CTG: each subsequent 30 minutes of CTG tracings had to be classified as normal, intermediary, abnormal or (pre)terminal (categorical outcome) according to the STAN clinical guidelines as presented in the Appendix. If a 30-minute episode contained more than one CTG category, the observer was asked to classify the part with longest duration and/or to choose the worst category.
  • 2
    Decision to intervene: after classification of the CTG, observers were asked to interpret possible ST events and to decide whether (dichotomous outcome) they would perform an intervention and at what time that intervention should take place, according to STAN clinical guidelines (Appendix). In case observers decided to perform an intervention, they were also asked to indicate whether the intervention was based on the interpretation of the CTG combined with ST data or on CTG only. Observers were not asked to further specify their interventions.

In case an intervention was suggested before the end of the 2 hours recording, the following part of the tracing was not revealed to them anymore. Observers were only provided with information on stage of labour (dilatation or active pushing), without knowledge of other clinical parameters or neonatal outcome, and blinded to each others’ results.

In January 2007 (T1), this entire procedure was repeated. The same 73 cases were presented to the six observers in random order to allow for quantification of the intra-observer agreement or reproducibility.

Data analysis

Observations at T0 and T1 by the same observer were used to quantify the intra-observer agreement or reproducibility, whereas the inter-observer reproducibility was quantified by first comparing the observations across the different observers both at T0 and at T1.

For the dichotomous outcome ‘decision to intervene’, we estimated the kappa statistics (K) to quantify inter- and intra-observer agreement. Kappa is a measure of reproducibility or agreement in which chance-expected agreement is incorporated.24 Kappa values <0.40 were considered poor agreement, between 0.40 and 0.75 as moderate and >0.75 as excellent agreement.24 For the outcome ‘classification of CTG’, a categorical variable with four categories, we used proportions of agreement or reproducibility. Inter-observer agreement of ‘CTG proportions of agreement’ is often calculated according to Grant.25 We explicitly chose to calculate the so-called proportion of specific agreement (Ps) according to Fleiss24 since the latter is a conditional probability, here the probability that an observer will make an assignment to a certain category conditional on the same categorisation of another randomly selected observer. The Ps was also calculated for the categorical outcome classification of CTG.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Inter-observer agreement

CTG classification outcome

All observers successfully assessed the 73 recordings at T0 and T1. Table 1 shows percentages agreement on the four 30-minute CTG traces between observer A1 and A2 (‘experts’, considered as reference) at T0. The number of cases that could be classified decreased from 73 in t1 to 51 in t4 since decisions to perform an intervention before t4 excluded cases from analysis of the subsequent 30-minute episodes (Table 1). Similar results were found for the observers in the categories B (B1 and B2) and C (C1 and C2) as given in Figure 1. Figure 1 also shows the large variation in CTG classification between the six observers, which increased over the CTG traces from t1 to t4. The first episode (t1) predominantly showed a normal CTG class among all observers, whereas the last episode (t4) showed more intermediary and abnormal classes. Although this pattern applied to all six observers, the extent of the variation considerably differed between observers. The recordings at T1 showed similar results with the same pattern over the four time recordings (from t1 to t4), only slightly higher percentages of agreement (data not shown).

Table 1.  Agreement on CTG classification for the 4-time intervals of the registration (t1–t4) for observers of category A (A1 versus A2), at T0
CTG classEpisodes t1–t4
t1 (n = 73)t2 (n = 67)t3 (n = 59)t4 (n = 51)
  1. NA, not applicable, no cases in which both observers decided to assign this class. For details on the type of observer and time intervals see Observers and Measurements sections in text.

Normal CTG40291811
Intermediary CTG109147
Abnormal CTG26410
(Pre)terminal CTGNANANANA
Total agreement52443628
71.2%65.7%61%54.9%
image

Figure 1. Course of CTG classification throughout time by six observers at T0 (for details see Observers section in text).

Download figure to PowerPoint

The left part of Table 2 shows that across all types of observers at T0, there was poor agreement for each CTG classification with the exception of the normal and (pre)terminal trace. For a normal CTG class, observers of category C agreed best (Ps 0.84). Between categories for all CTG classes, observers A1 and B1 agreed best (Table 2, right part, Ps 0.51–0.80). Agreement was again highest for the normal CTG class in which observer A1 had equal agreement with B1 and C1 (Ps 0.77–0.78), whereas observers B1 and C1 agreed less (Ps 0.70). Again, similar results (Ps values) were found when analysing the T1 observations.

Table 2.  Proportions of specific agreement (Ps) on CTG classification for several inter-observer combinations at T0 (n = 73)
PsInter-observer combinations for agreement on CTG classification (n = 73)
Within observer categoriesBetween observer categories
A1-A2B1-B2C1-C2A1-B1A1-C1B1-C1
  1. NA, not applicable, no cases in which both observers decided to assign this class. For details on the type of observers see Observers section in text.

Normal CTG0.790.710.840.780.770.70
Intermediary CTG0.490.450.490.560.410.41
Abnormal CTG0.520.380.420.510.370.34
(Pre)terminal CTGNA0.67NA0.800.670.50
Decision to intervene outcome

At T0 in 43 of the 73 cases (59%), at least one observer decided to perform an intervention. In 25.6% (11/43) of cases, observers decided to intervene on the CTG alone: in 9 of these 11 cases, this decision was made by one or two observers and in the remaining 2 cases by five or six observers. In the other 74.4% (32/43) of the cases, at least one observer decided to intervene based on ST information.

According to the proportions of agreement, the inter-observer agreement to perform ‘no intervention’ was highest (Table 3), although the Ps for ‘intervention’ was also high, except for B1-B2 (only 0.50). At T1, similar results were found, except that the Ps for intervention for B1-B2 was 0.75 instead of 0.50. Overall agreement on the decision to intervene was considered moderate in five of six observer combinations according to the kappa (K range 0.42–0.73) (Table 3). Within categories (left side of Table 3), observers of category C (C1-C2) had excellent agreement (K = 0.81), which was also shown by highest proportions of specific agreement (Ps 0.94 for no intervention and 0.86 for intervention). At T1, agreement for observer pair C1-C2 was somewhat lower (K = 0.68), but they still showed highest proportions of specific agreement (Ps 0.92 for no intervention and 0.76 for intervention). Between different observer categories (right side of Table 3), observers of category A and B appeared to agree best (K = 0.73), which was also indicated by the high proportions of specific agreement (Ps 0.86–0.87). At T1, again the agreement between different categories of observers was similar except that the agreement for observer pair A1-C1 had a kappa of 0.36 instead of 0.49 at T0.

Table 3.  Kappa values (K) and proportions of specific agreement (Ps) on decision to intervene for several inter-observer combinations at T0 (n = 73)
K or PsInter-observer combinations for agreement on decision to intervene (n = 73)
Within observer categoriesBetween observer categories
A1-A2B1-B2C1-C2A1-B1A1-C1B1-C1
  1. For details on the type of observers see Observers sections in text.

K0.670.420.810.730.490.50
Ps no intervention0.860.760.940.870.800.80
Ps intervention0.800.500.860.860.680.69

Regarding the timing of intervention, there was complete agreement between the six observers in 15 of 43 cases (34.9%). In 11 cases (25.6%), there was agreement within a time frame of 30–60 minutes. In ten cases (23.3%), there was agreement in a time frame of 60–90 minutes (n = 8) or 90–120 minutes (n = 2). In the remaining seven cases (16.2%), only one observer decided to intervene. In 10 of the 11 metabolic acidosis cases, an intervention was suggested by at least one observer (Table 4). In six of these, the timing to intervene did not differ more than 30 minutes between the six observers, and in three cases, this range was 60 minutes. In one case (number 19), only one observer (B1) decided to perform an intervention, and in the remaining one (number 57), no intervention was decided upon at all. Only in two cases (numbers 10 and 68), all observers decided to intervene for the same reason.

Table 4.  Timing of decision to intervene in cases of metabolic acidosis at T0 (pH < 7.05 and BD > 12; n = 11)
CaseTiming of decision to intervene according to observers A1-C2
A1A2B1B2C1C2
  1. C, intervention according to CTG; N, no intervention; S, intervention according to CTG + ST; 1,2,3,4 = first, second, third and fourth part of registration.

10S-4S-4S-4S-4S-4S-4
19NNS-4NNN
20C-3S-4S-4NC-3N
25S-1S-3S-3NNN
34S-4S-4S-4S-4NS-4
41C-2S-4S-4NS-4S-4
51NNC-3NS-4C-4
57NNNNNN
68C-1C-2C-1C-1C-2C-1
69C-2S-4C-3C-2C-3C-3
70S-3C-4C-3S-3S-4S-3

Intra-observer agreement

Table 5 shows the results of the intra-observer agreement. Kappa statistics indicate higher agreement compared with inter-observer agreement results. Observer B1 has the highest agreement when comparing T0 and T1. Overall, the intra-observer agreement for classification of the CTG and decision to intervene was moderate.

Table 5.  Intra-observer agreement on CTG classification and decision to intervene for T0 and T1, expressed by kappa values
 A1A2B1B2C1C2
CTG classification0.640.590.670.520.620.54
Decision to intervene0.720.680.750.610.620.64

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

We have shown that there is large variation in classification of the intrapartum CTG, even when FIGO guidelines are used. Inter-observer agreement for the normal CTG was good but decreased when the CTG became intermediary or abnormal. (Pre)terminal traces again showed a moderate to good inter-observer agreement. The slightly higher percentages of inter-observer agreement at T1 (compared with T0) may be explained by a small learning effect of reading and classifying CTG traces.

For the normal CTG, observers with little (category C) and much (category A) STAN experience agreed better than observers with half a year of experience (category B). This might be due to the fact that training of ‘beginners’ is still fresh and they therefore follow the guidelines more strictly than other observers. Experts may agree better because of their larger experience. For all CTG classes, agreement between ‘experts’ and observers with at least half a year of experience with ST analysis (A1-B1) appeared to be best.

The observers agreed better on the decision to intervene than on the CTG classification, especially on the decision to perform no intervention. Observers with less than half a year of experience with ST analysis (‘beginners’) agreed best. However, their agreement with more experienced observers was poor to moderate. This may indicate that although ‘beginners’ agree well with each other, their judgement concerning decision to intervene seems to be different and perhaps wrong compared with more experienced observers, assuming the latter to make more often the correct decision (‘reference’). In our study, we found excellent agreement for ‘beginner’ observers at T0, whereas at T1 for this group agreement was considered moderate. Perhaps, their excellent agreement at T0 was accidentally achieved, also due to relatively small numbers.

In the STAN methodology, the decision to intervene or otherwise depends on both CTG interpretation and interpretation of ST events. Our study indicates that the efficacy of this method of fetal surveillance, although proven to be promising,9,12 seems hampered by a poor to moderate agreement for CTG interpretation. It was reassuring that agreement on normal and (pre)terminal CTG traces was relatively good since with such heart rate patterns additional information on ECG waveforms is not required. Although agreement for CTG interpretation was moderate, the observers agreed quite well on the timing of an intervention, which in the end is the most important decision in daily clinical practice. Possibly, the availability of ST information and use of STAN guidelines result in a more standardised assessment of the CTG and the total clinical situation, which may eventually result in better agreement on decision to intervene.

There are some possible limitations of this study that have to be discussed. The first may seem the relatively low number of abnormal and (pre)terminal CTG traces in the selected women. We, however, explicitly decided not to overrepresent such traces to ensure that observers were not exposed to abnormal CTGs only and that they paid full attention to the whole spectrum of CTG tracings. For fetal surveillance, agreement on both normal and abnormal CTG assessments is important: disagreement on abnormal CTGs may result in infants being damaged by hypoxia and disagreement on normal CTGs may cause unnecessary interventions. It is therefore necessary to consider agreement for abnormal and normal CTG assessments separately.

Second, since this study concerns classification of CTG traces, it is possible that some observer bias has played a role because a 30-minute CTG trace may both show intermediary and abnormal parts. Although in advance, observers were asked to classify such traces in the worst category, this still may have increased inter-observer variability.

Third, for simplicity, we choose to present data on ‘between category’ agreement for CTG classification and decision to intervene only for the first observer combinations A1-B1, A1-C1 and B1-C1. Agreement for observer pairs A2-B2, A2-C2 and B2-C2 was similar.

Finally, a drawback of this study may be the use of paper printouts for CTG assessment, which may create a situation without optimal mimicry of clinical practice.

In conclusion, we found a large variation in classification of the intrapartum CTG, despite the use of FIGO guidelines and availability of ST information. Agreement for the normal and (pre)terminal CTG trace was good but decreased when CTG traces were intermediary or abnormal. Agreement was better on the decision to intervene, especially on the decision not to intervene and on the timing of the intervention. This suggests that addition of information regarding fetal ECG, especially in case of intermediary or abnormal CTG traces, results in a more standardised decision to intervene or otherwise.

Disclosure of interests

There are no disclosures of interests.

Contribution to authorship

MW, EH, AK, IT, GV and KM designed the study; MW and EH collected the data; MW, EH and IT analysed the data; MW, EH, AK, IT, GV and KM wrote the paper.

Details of ethics approval

Not applicable for this type of observational study.

Funding

There was no funding of this study.

Acknowledgments

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Acknowledgement

This research was supported by a grant from ZonMW, the Dutch Organisation for Health Research and Development (grant number: 945-06-557).

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Appendix

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Appendix: STAN® clinical guidelines

Classification of cardiotocographic patterns according to FIGO guidelines

Cardiotocographic classificationBaseline heart frequencyVariability reactivityDecelerations
  • *

    Combination of several intermediary observations will result in an abnormal CTG.

Normal110–150 beats per minute5–25 beats per minute; accelerationsEarly decelerations; uncomplicated variable decelerations with a duration of <60 seconds and a beat loss of <60 beats per minute
Intermediary*100–110 beats per minute; 150–170 beats per minute; short bradycardia episode>25 beats per minute without accelerations; <5 beats per minute for >40 minuteUncomplicated variable decelerations with a duration of <60 seconds and a beat loss of >60 beats per minute
Abnormal150–170 beats per minute and reduced variability; >170 beats per minute<5 beats per minute for >60 minute; sinusoidal patternRepeated late decelerations; complicated variable decelerations with a duration of > 60 seconds
PreterminalTotal lack of variability and reactivity with or without decelerations or bradycardia

ST changes that prompt clinical intervention such as delivery or solving a cause of fetal distress

 Intermediary CTGAbnormal CTG
  1. The ST log requires 20 minutes recording for automatic ST analysis to start. A decrease in signal quality with insufficient number of T/QRS measurements requires manual data analysis.

Episodic T/QRS rise (duration <10 minutes)Increase >0.15 from baselineIncrease >0.10 from baseline
Baseline T/QRS rise (duration ≥10 minutes)Increase >0.10 from baselineIncrease >0.05 from baseline
Biphasic ST (a component of the ST segment below the baseline)Continuous >5 minutes or >2 episodes of coupled biphasic ST type 2 or 3Continuous >2 minute or >1 pisode of coupled biphasic ST type 2 or 3