Inter- and intra-observer agreement of non-reassuring cardiotocography analysis and subsequent clinical management

Authors

  • Sarah Rhöse,

    Corresponding author
    1. Department of Obstetrics and Gynecology, Radboud University Nijmegen Medical Center, Nijmegen, the Netherlands
    • Correspondence

      Sarah Rhöse, Department of Obstetrics and Gynecology, Radboud University Nijmegen Medical Centre, PO BOX 9101, 6500 HB Nijmegen, the Netherlands.E-mail: s.rhose@gmail.com

    Search for more papers by this author
  • Ayesha M.F. Heinis,

    1. Department of Obstetrics and Gynecology, Radboud University Nijmegen Medical Center, Nijmegen, the Netherlands
    Search for more papers by this author
  • Frank Vandenbussche,

    1. Department of Obstetrics and Gynecology, Radboud University Nijmegen Medical Center, Nijmegen, the Netherlands
    Search for more papers by this author
  • Joris van Drongelen,

    1. Department of Obstetrics and Gynecology, Radboud University Nijmegen Medical Center, Nijmegen, the Netherlands
    Search for more papers by this author
  • Jeroen van Dillen

    1. Department of Obstetrics and Gynecology, Radboud University Nijmegen Medical Center, Nijmegen, the Netherlands
    Search for more papers by this author

  • The authors have no relevant financial (for example patent ownership, stock ownership, consultancies, speaker's fees, shares), personal, political, intellectual (organizing education) or religious interest requiring disclosure.

Abstract

Objective

To quantify inter- and intra-observer agreement of non-reassuring intrapartum cardiotocography (CTG) patterns and subsequent clinical management.

Design

Methodological study.

Setting

University Medical Center.

Population

CTG patterns of 79 women beyond 37 weeks of gestation with a singleton fetus in vertex position in first stage of labor in whom fetal blood sampling (FBS) had been performed.

Methods

Nine observers assessed CTG patterns, which were formerly clinically classified as non-reassuring and indicative for FBS, according to the guidelines of the International Federation of Gynecology and Obstetrics modified for ST analysis. They also proposed clinical management strategies without and with insight into clinical parameters. Weighted kappa values (κw) and proportions of agreement (Pa) were calculated.

Main outcome measures

Agreement on CTG classification and clinical management.

Results

Inter-observer agreement on CTG classification and on clinical management were poor for most observer categories (κw range 0.31–0.50 and 0.20–0.45, respectively). Observers agreed best on abnormal CTG patterns (Pa range 0.28–0.36) and on the clinical management option “continue monitoring” (Pa range 0.32–0.40). Intra-observer agreement was fair to good for most observers (κw 0.33–0.70). Insight into clinical parameters resulted in similar inter- and intra-observer agreement.

Conclusions

There was poor inter-observer agreement and fair to good intra-observer agreement on classification and clinical management of intrapartum CTG patterns, which had been classified as non-reassuring and indicative for FBS during birth.

Abbreviations
CTG

cardiotocography

FBS

fetal blood sampling

FIGO/STAN guidelines

guidelines of the International Federation of Gynecology and Obstetrics modified for ST analysis

κw

weighted kappa (in case of three categories weights 1, 0.5 and 0, in case of four categories weights 1, 0.66, 0.33 and 0)

P a

proportions of agreement

CI

confidence interval

RUNMC

Radboud University Nijmegen Medical Center

Key Message

Inter- and intra-observer agreement of CTG classification is poor. Even in cases formerly clinically classified as non-reassuring and indicative for fetal blood sampling, there is poor inter-observer and fair to good intra-observer agreement on CTG classification and subsequent clinical management.

Introduction

Cardiotocography (CTG) is a technique used to monitor intrapartum fetal condition. It is a continuous simultaneous record of the fetal heart rate and of the presence of uterine activity [1]. To classify a CTG pattern, several classification systems exist, of which the International Federation for Gyncology and Obstetrics/ST-analysis (FIGO/STAN) guidelines have probably reached the broadest consensus [2]. One of the main disadvantages of the CTG is that there is substantial variation in interpreting CTG patterns. Previous research has shown that inter- and intra-observer agreement of CTG classification is often poor [3-6]. The specificity of a CTG is also low, i.e. many false-positive test results for poor neonatal outcome [7]. As a consequence, unnecessary obstetric interventions for suspected fetal distress are performed [1, 7-9]. In an attempt to reduce these unnecessary interventions, fetal blood sampling (FBS) can be used as a diagnostic test to gain additional information on fetal condition [10]. However, FBS is an invasive, cumbersome procedure with failure rates of up to 20% [8] and it is only assumed that FBS reduces the cesarean section rate [1]. Therefore, a balance should be struck between the use of FBS and the rate of cesarean sections.

Observations suggest considerable variation in the indications to perform FBS. First, FBS use varies between countries, hospitals and care workers. In the Netherlands, FBS is performed in 3–15% of deliveries [11] but these rates are different in other countries [12-14]. Secondly, there is a lack of evidence-based guidelines on when to perform FBS [15, 16]. In addition, the impression exists that FBS is performed on more liberal indications than described in the existing guidelines, i.e. for reassurance on fetal condition before active pushing is started.

The objectives of this study were to quantify variation (i.e. inter- and intra-observer agreement) in (i) the classification of intrapartum CTG patterns prior to FBS, according to the FIGO/STAN guidelines [17], and (ii) management based on this classification. The impact of insight into clinical parameters and also of the observers' level of experience was analyzed. The relationship between fetal scalp pH and CTG category was also evaluated.

Material and methods

We performed a methodological study at the Radboud University Nijmegen Medical Center (RUNMC), which is a tertiary referral hospital with approximately 1500 deliveries a year. In the Netherlands, independent primary care midwives provide care to low-risk women. If complications or risk factors occur during pregnancy, labor or after birth, women are referred to secondary or tertiary care. After referral, care may be provided by clinical midwives or obstetric registrars but the care is always the final responsibility of an obstetrician. This system is also in place for the RUNMC. At RUNMC, the fetal condition during labor is monitored with continuous CTG and FBS is performed in 11% of deliveries [18]. This study was approved by the Research Ethics Committee at the RUNMC (2012/080).

We selected all women who delivered between February 2010 and March 2011 in whom FBS was performed based on a non-reassuring CTG during the first stage of labor. Women were included if they complied with the following inclusion criteria: singleton fetus in vertex position, gestational age of 37 weeks or over, and optimal CTG registration 60 min prior to FBS.

Nine observers were asked to participate in the study. They were divided into three groups according to their specialty level and level of experience in obstetrics. Group A contained four obstetricians with at least 1 year of experience as obstetrician. Group B contained three registrars in gynecology in their second or third educational year (full educational period is 6 years with obstetrics incorporated in each year). Group C contained two clinical midwives with 2–3 years of clinical experience. All observers use fetal monitoring with CTG regularly in clinical practice. They had received training as a standard part of their education. Knowledge about CTG interpretation is exchanged in the daily morning protocol where CTG patterns are discussed with attending midwives, registrars and obstetricians.

For each case, we extracted CTG patterns from the obstetric database of the RUNMC. We identified the 60-min CTG pattern prior to FBS. If fetal blood was sampled more than once, the CTG pattern prior to the first sample was used. We also extracted clinical parameters, including general information (maternal age, obstetric history and risk factors in the current pregnancy), parameters that influence progression of labor (parity, cervical dilation, augmentation of labor) and parameters that influence fetal condition (gestational age, the presence of meconium-stained amniotic fluid, maternal body temperature and estimated fetal weight).

The nine observers were asked to assess the CTG patterns twice with an interval of at least 1 month: T0 and T1. To assess the CTG patterns, a web-based tool was developed which provided the observers with CTG patterns and corresponding clinical parameters. This web-based tool did not allow observers to change previous answers and presented the cases in random order at T0 and T1. The assessment procedure went as follows. First, the observers were shown the anonymized CTG pattern (paper speed 2 cm/min, which is commonly used in the Netherlands) and were asked to classify it according to the FIGO/STAN guidelines into one of four categories: normal, intermediary, abnormal or preterminal. At this time observers were provided with the information that the CTG was taken at the dilation phase of a singleton fetus in vertex position in whom FBS had been performed during labor. Secondly, observers were asked to choose a clinical management option: (i) continue monitoring with the available measures to determine the causes of fetal distress by conservative management (e.g. treating hypotension, discontinuing oxytocin infusion, changing maternal position), (ii) perform FBS, or (iii) immediate delivery (within 15 min). Thirdly, observers were provided with the corresponding clinical parameters and were then asked again to choose a clinical management option.

Statistical analyses

The CTG classification and the clinical management strategy chosen by different observers were used to calculate inter-observer agreement. Intra-observer agreement was calculated using observations from the same observer at T0 and T1. SPSS for Windows Rel. 18.0.2 (SPSS Inc., Chicago, IL, USA) and SAS/GRAPH® 9.2 Second Edition (SAS Institute Inc., Cary, NC, USA) were used for the data analysis.

Weighted kappa (κw) was calculated for the categorical ordinal variables “CTG classification” and “clinical management.” Proportions of agreement (Pa) were calculated for each individual CTG category and clinical management option [19]. Weighted kappa values above 0.75 were considered excellent agreement, between 0.40 and 0.75 fair to good agreement and below 0.40 poor agreement.

Both κw and Pa describe the amount of agreement, but in different ways. κw measures agreement beyond agreement expected by chance [20]. It measures the degree of association between two variables, but not their true agreement [19]. Pa shows the proportion of cases on which observers agree [20]. A limitation of Pa is that it does not take into account agreement expected by chance.

Prior to this study, we performed power analysis. A sample size estimation for analyzing agreement (kappa) between two observers was made with α = 0.05 and β = 0.2. To investigate the null hypothesis H0: p0 = 0.4 and the alternative hypothesis H1: p1 = 0.6, a sample size of 90–100 CTG patterns was estimated [20, 21].

For correlation analysis of CTG classification and scalp pH, we selected those cases in which consensus on CTG classification was reached. We defined consensus as CTG category inter-observer agreement of more than 75% of observers ( 7). Scalp pH was considered non-reassuring if pH was below or equal to 7.25 and acidotic if pH was below 7.20. The proportion of non-reassuring scalp pH was calculated for each CTG category.

Results

During the study period, 1755 women delivered at the RUNMC. Ninety-seven women met the inclusion criteria. Table 1 summarizes the characteristics of the mother and baby pairs. In 15 cases, scalp pH was non-reassuring, in one case scalp pH was acidotic and in 14 cases scalp pH analysis had failed.

Table 1. Characteristics of 97 mother and baby pairs
Characteristic 
  1. FBS, fetal blood sampling.

  2. a

    SGA, small for gestational age: estimated fetal weight below the 10th percentile of the Verburg growth curve [25].

  3. b

    LGA, large for gestational age: estimated fetal weight above the 90th percentile of the Verburg growth curve [25].

  4. c

    Maternal temperature ≥37.8°C.

  5. d

    Small-for-gestational age, i.e. below the 10th percentile of the National Perinatal Registry curves [26].

  6. e

    Large-for-gestational age, i.e. above the 90th percentile of the National Perinatal Registry curves [26].

Maternal age (median, range)31 years19–43 years
Gestational age (median, range)281 days263–295 days
Nulliparous women (n, %)7880
Estimated fetal weight
SGA (n, %)a88
LGA (n, %)b55
Cervical dilatation at time of FBS (median, range)7 cm3–10 cm
Maternal fever (n, %)c1616
Augmentation of labor (n, %)7173
Meconium-stained amniotic fluid (n, %)2526
FBS pH ≤7.25 (n, %)1515
Birthweight (median, range)3440 g2405–4618 g
SGA (n, %)d1111
LGA (n, %)e99

All observers assessed 97 CTG patterns at T0 and T1. Mean assessment time of all CTG patterns at T0 was 78 min, ranging from 46 to 121 min. At T1, mean assessment time was 58 min, ranging from 29 to 97 min. Most CTG patterns were classified as intermediary or abnormal (Figure 1). The clinical management strategy “FBS” was chosen in 23% of CTG patterns classified as intermediary and in 72% of CTG patterns classified as abnormal (Figure 1).

Figure 1.

Clinical management strategy per cardiotocography (CTG) category at T0 (total number of CTG patterns was 873 as nine observers each assessed 97 CTG patterns). FBS, fetal blood sampling.

At T0, inter-observer agreement on classification of CTG patterns was poor within and between most observer categories, except for clinical midwives, whose agreement was fair to good (Table 2). Agreement was best for abnormal CTG patterns (Pa all observers 0.31). Agreement on clinical management was poor within and between most observer categories (Table 3). Registrars and clinical midwives agreed moderately. Observers agreed best on the clinical management option “continue monitoring.”

Table 2. Inter-observer agreement (weighted kappa (κw) and proportions of agreement (Pa)) at T0 on cardiotocography classification within and between observer categories
   Pa
   NormalIntermediaryAbnormalPreterminal
Observer categoryκw95% CI 95% CI 95% CI 95% CI 95% CI
  1. CI, confidence interval.

  2. a

    NA, not applicable, no cases in which two observers decided to choose this category.

All observers0.340.27–0.410.230.20–0.250.230.22–0.250.310.30–0.330.090.04–0.14
Obstetricians (A)0.310.23–0.390.200.14–0.260.240.20–0.270.280.24–0.320.11−0.09 to 0.32
Registrars gynecology (B)0.330.23–0.430.200.10–0.300.190.13–0.240.330.28–0.380.150.02–0.29
Clinical midwives (C)0.500.35–0.650.300.14–0.460.300.19–0.410.360.27–0.46NAa 
A–B0.300.23–0.370.200.17–0.240.220.20–0.240.300.27–0.320.100.04–0.17
A–C0.370.29–0.450.240.20–0.280.250.23–0.270.310.28–0.330.08–0.03–0.19
B–C0.380.29–0.470.240.18–0.290.230.19–0.260.340.32–0.370.100.02–0.18
Table 3. Inter-observer agreement [weighted kappa (κw) and proportions of agreement(Pa)] at T0 on clinical management within and between observer categories
Observer categoryκwPa b
WithoutaWithaContinue monitoringFBSImmediate delivery
 95% CI 95% CI 95% CI 95% CI 95% CI
  1. CI, confidence interval; FBS, fetal blood sampling.

  2. a

    Insight into clinical parameters.

  3. b

    Without insight into clinical parameters.

All observers0.290.22–0.360.240.18–0.300.340.33–0.360.280.26–0.290.100.06–0.14
Obstetricians (A)0.200.12–0.280.130.06–0.200.280.25–0.320.270.24–0.310.11−0.09 to 0.32
Registrars gynecology (B)0.400.29–0.510.420.32–0.520.380.33–0.430.290.23–0.350.120.01–0.23
Clinical midwives (C)0.450.29–0.610.410.25–0.570.400.31–0.480.290.18–0.400.25−0.17 to 0.67
A–B0.250.18–0.320.210.15–0.270.320.30–0.340.270.25–0.290.080.03–0.13
A–C0.270.20–0.340.190.13–0.250.330.30–0.350.280.25–0.300.140.03–0.26
B–C0.380.29–0.470.370.28–0.460.380.35–0.410.290.25–0.320.120.05–0.19

At T1, agreement on CTG classification and clinical management strategy was slightly lower than agreement at T0 for observer categories “clinical midwives” and “registrars.” The within-observer category “obstetricians” and the between-observer categories agreement was similar to the agreement at T0.

Intra-observer agreement on CTG classification and clinical management strategy was fair to good for most observers (Table 4). Intra-observer agreement was best on abnormal CTG patterns and on the clinical management option “continue monitoring.”

Table 4. Intra-observer agreement [weighted kappa (κw) and proportions of agreement (Pa)] of all observers on cardiotocography classification
    P a
   NormalIntermediaryAbnormalPreterminal
Observerκw95% CI 95% CI 95% CI 95% CI 95% CI
  1. CI, confidence interval.

  2. a

    NA, not applicable, no cases in which the observers decided twice to choose this category.

A10.360.20–0.520.11−0.09 to 0.320.580.47–0.690.420.27–0.57NAa 
A20.660.55–0.770.520.34–0.700.380.22–0.530.730.62–0.85NAa 
A30.560.43–0.690.480.28–0.680.450.32–0.590.570.43–0.70NAa 
A40.590.46–0.720.530.29–0.770.630.51–0.740.460.30–0.620.25−0.14 to 0.67
B10.330.17–0.490.670.29–1.040.310.18–0.440.480.37–0.59NAa 
B20.430.28–0.580.270.01–0.540.260.12–0.400.690.58–0.79NAa 
B30.530.40–0.660.530.30–0.750.530.41–0.650.490.34–0.64NAa 
C10.670.55–0.790.500.24–0.760.490.34–0.630.710.60–0.830.670.13–1.20
C20.700.58–0.820.440.25–0.630.610.47–0.760.810.70–0.92NAa 

At T0, after the observers had received insight into clinical parameters, inter-observer agreement was slightly lower within and between most observer categories (Table 3). At T1, a similar decrease in inter-observer agreement was seen. The degree to which insight into clinical parameters changed intra-observer agreement differed between the individual observers (Table 5).

Table 5. Intra-observer agreement [weighted kappa (κw) and proportions of agreement (Pa)] of all observers on clinical management strategy
Observerκw P a b
WithoutaWithaContinue monitoringFBSImmediate delivery
 95% CI 95% CI 95% CI 95% CI 95% CI
  1. CI, confidence interval; FBS, fetal blood sampling.

  2. a

    Insight into clinical parameters.

  3. b

    Without insight into clinical parameters.

  4. c

    NA, not applicable, no cases in which the observer decided to choose this category at both T0 and T1.

A10.320.00–0.640.300.00–0.600.230.00–0.460.890.83–0.96NAc 
A20.530.34–0.720.580.40–0.760.800.72–0.890.450.28–0.63NAc 
A30.550.40–0.700.580.43–0.730.680.57–0.790.570.43–0.70NAc 
A40.600.45–0.750.220.01–0.430.790.69–0.880.490.33–0.650.25−0.17 to 0.67
B10.460.30–0.620.490.34–0.640.580.45–0.710.530.41–0.660.20−0.05 to 0.45
B20.480.31–0.650.510.35–0.670.680.57–0.780.490.35–0.63NAc 
B30.540.40–0.680.600.45–0.750.770.67–0.870.420.26–0.580.13−0.10 to 0.35
C10.680.54–0.820.660.52–0.800.720.61–0.840.680.55–0.800.670.27–0.13
C20.650.47–0.830.530.34–0.720.830.74–0.910.520.34–0.701.001.00–1.00

At T0, there was consensus in 43 CTG patterns: four normal, 12 intermediary and 25 abnormal. In these CTG patterns, the proportion of non-reassuring scalp pH was 0%, 10% and 35%, respectively.

At T1, there was consensus in 33 CTG patterns: four normal, 11 intermediary and 18 abnormal. In these CTG patterns the proportion of non-reassuring scalp pH was 0%, 40% and 23%, respectively.

Discussion

We found considerable variation in the classification of CTG patterns. Observers agreed poorly with each other and fair to good with themselves on CTG classification and clinical management. They achieved the highest agreement on abnormal CTG patterns and on the clinical management option “continue monitoring.”

CTG classification and subsequent clinical management suffer from individual interpretation. The literature is not consistent on the degree of agreement in high risk deliveries, as inter-observer agreement varies from “poor” to “fair to good” [22-24]. This inconsistency may be partly dependent upon the classification system used. In correspondence with the literature [24], we detected poor inter-observer agreement and “fair to good” intra-observer agreement when classifying CTG patterns according to the FIGO/STAN guidelines. This implies that, even with strict guidelines, the agreement on CTG-classification between observers is poor, although individuals appear to score CTG patterns more consistently.

The level of experience and type of profession may influence CTG interpretation. Westerhuis et al. [24] showed differences indicative of higher rates of agreement in more experienced and recently educated professionals. We could not confirm this observation, as our study showed that both clinical midwives and registrars agreed better than the obstetricians. One may speculate on possible explanations: (i) registrars are all trained in the same teaching hospital, whereas obstetricians may have different backgrounds; (ii) clinical midwives and registrars are more “bed-side” than obstetricians and have more current experience than obstetricians; (iii) the observer category size differed in the various groups (four obstetricians, three registrars and two clinical midwives), increasing the probability that the fair to good agreement of midwives was due to chance.

CTG classification is used to identify those fetuses at risk for metabolic acidosis. The literature reports that the rate of agreement on clinical management correlates to fetal umbilical cord pH at birth [23]. In line with this, we found that CTG category correlates with scalp pH if several observers agree about the CTG category. However, these results must be interpreted with caution as they are based on a small sample. Therefore it is still unknown whether better agreements are correlated with “true” interpretations and if better agreement will improve neonatal outcome.

Some physicians share the opinion that a CTG can only be assessed adequately with insight into clinical parameters. We did not find literature that investigated the value of additional clinical parameters to CTG interpretation. In our study, the insight into clinical parameters did not influence inter-observer agreement. Possibly, the degree to which observers included the clinical parameters in their CTG assessment was heterogenic. Furthermore, by providing clinical parameters we added more variables about which observers could disagree.

Our study provides additional insight into CTG interpretation. However, some limitations need to be addressed. First, we only included CTG patterns which were formerly clinically classified as indicative for FBS. This may have led to a relatively low number of normal and preterminal CTG patterns. As a result, sufficient CTG patterns were available to say something conclusive about the indication to perform FBS with a reasonable total number of CTG patterns [19]. Additionally, Westerhuis et al. [24] showed good inter-observer agreement on CTGs classified as normal or (pre)terminal, which makes this category of CTG patterns less interesting. Further, all observers were provided with the information that the CTG patterns were formerly diagnosed as non-reassuring and indicative for FBS. This may have influenced observers' assessments, for example by causing them to choose the clinical management option “FBS” more often. However, by providing this information we ensured all observers assessed the CTG patterns with the same foreknowledge.

Another limitation is that only two midwives participated in our study. In an optimal setting we would have wanted at least three observers in each observer group to achieve a narrow enough confidence interval around Pa when using 97 CTG patterns. This means the results in the midwives group are more sensitive to observer bias and should therefore be interpreted with care.

Finally, observers assessed the CTG patterns using the FIGO/STAN guidelines, whereas the CTG patterns stem from a period of time in which the FIGO/STAN guidelines were not used at the hospital. Probably, the indications to perform FBS were more liberal before the guidelines were put in place.

Our study showed that interpretation, i.e. classification of CTG patterns, remains difficult even when using the FIGO/STAN guidelines. This has clinical implications regarding the use of FBS and fetal monitoring with CTG. Since the positive predictive value of FBS is expected to be low if the a priori chance for a fetus suffering from acidosis is low, we think the indication to perform FBS should be more strictly described to ensure the use of FBS is as efficient as possible. Moreover, physicians should be aware of the fact that there is a relatively high chance their colleagues might classify a CTG pattern differently or choose another clinical management strategy, underlining the importance of critical discussion between colleagues. There is a need for a stricter and better implemented CTG classification system, which may help to increase the rate of agreement on the classification and clinical management of CTG patterns.

Acknowledgements

We thank Bas Timmermans for developing the web-based tool for the CTG assessment.

Funding

There was no special funding in this study.

Ancillary