To assess intra- and interobserver agreement of routinely performed measurements—crown–rump length (CRL) and mean gestational sac diameter (MSD)—for assessing the likelihood of miscarriage in the first trimester of pregnancy using transvaginal sonography.
A cross-sectional study of CRL and gestational sac measurements in first-trimester pregnancies was conducted in a fetal medicine referral center with a predominantly Caucasian population. Gestational age ranged from 6 to 9 weeks. All patients underwent a transvaginal ultrasound examination using a high-resolution ultrasound machine. Two measurements of CRL and measurements of three diameters of the gestational sac were obtained by two observers. Agreement within and between observers for CRL and between observers for MSD was analyzed using 95% prediction intervals, Bland–Altman plots with 95% limits of agreement and the intraclass correlation coefficient (ICC).
In total 54 patients were included in the study, with measurements obtained by both observers in 44 of these. Intra- and interobserver ICCs were high for CRL measurements, with values of 0.992 and 0.993 for intraobserver agreement and 0.993 for interobserver agreement. For the MSD, the interobserver ICC was 0.952. Limits of agreement were ± 8.91 and ± 11.37% for intraobserver agreement of CRL and ± 14.64% for interobserver agreement of CRL. For MSD, the interobserver limits of agreement were ± 18.78%. For an MSD measurement of 20 mm by the first observer, the prediction interval for the second observer was 16.8–24.5 mm. For a CRL measurement of 6 mm, the prediction interval for the second observer was 5.4–6.7 mm.
Threshold measurements for the diagnosis of viability or non-viability of a pregnancy by transvaginal sonography have been described and incorporated into management guidelines1. The current guidelines of the Royal College of Obstetricians and Gynaecologists define a missed miscarriage or early fetal demise based on ultrasound criteria as the visualization of an embryo with a length of ≥ 6 mm with no visible cardiac activity1. Similarly an empty sac is defined as one having a mean gestational sac diameter (MSD) of ≥ 20 mm without a visible embryo or yolk sac. Given the significance of the decisions being made on the basis of these measurements it is essential that both intra- and interobserver variability are taken into consideration.
The reproducibility of first-trimester crown–rump length (CRL) measurements was described in an early report with static image scanners for transabdominal sonography2, after the Robinson and Hadlock curves for dating had been constructed using similar equipment3–5. The reproducibility of fetal biometry in the second and third trimesters of pregnancy with transabdominal 3.5-MHz linear and curvilinear probes has been established6. More recently, a good level of reproducibility for measurements taken between 9 and 14 weeks' gestation (CRL, biparietal diameter, abdominal circumference and femur length) has been described using transabdominal ultrasonography7. However, ultrasound imaging technology has evolved rapidly over recent years, allowing better delineation of landmarks in the early first trimester. Moreover, as new reference CRL dating curves are constructed with predominantly transvaginal ultrasound, the reproducibility of CRL measurements at earlier gestations has become more relevant7, 8.
The primary aim of this study was to evaluate agreement in CRL measurements taken by the same observer (intraobserver agreement) and in measurements of CRL and MSD taken by different observers (interobserver agreement) between 6 and 9 weeks' gestation with transvaginal sonography, using modern ultrasound equipment. As a secondary goal, we aimed to investigate the implications for accurately determining gestational age based on CRL before 9 weeks and to show whether the level of reproducibility of these measurements has implications for making a diagnosis of miscarriage based on current cut-off levels.
We conducted a cross-sectional study on the landmarks of single viable embryos at different gestational ages between 6 and 9 weeks, in a referral center for fetal medicine with a predominantly Caucasian population, between 2008 and 2009. Patients presenting with an intrauterine pregnancy at between 6 and 9 weeks were included and were asked to consent to participate in the study once viability had been confirmed; patients were excluded if viability could not be confirmed.
All women underwent an ultrasound assessment using a Voluson E8 (GE Medical Systems, Zipf, Austria) machine with a 6–12-MHz transvaginal transducer for B-mode imaging. The date of the last menstrual period or known date of conception after infertility treatment was recorded. During the same visit patients were scanned consecutively by two observers (Observer A and Observer B), who were blinded to each other's measurements. The measurements of the first observer were always removed from the machine before the second observer entered the examination room.
CRL was measured to the nearest mm from the outer ends of the embryo (greatest length), as described previously8. To assess intraobserver agreement of the CRL measurements, the CRL was measured twice by each observer. The probe was moved away from the uterus and the ovaries were examined in between measurements of the fetus. Measurements of the gestational sac were taken in three orthogonal planes from the inner borders of the sac9, and the MSD was calculated.
All ultrasound assessments were carried out by the same two gynecologists with specialist training in obstetric and gynecological sonography. Ethics committee approval was obtained at the University Hospitals Leuven and the patients gave informed consent.
Statistical analysis was performed using SAS 9.2 (SAS Institute Inc., Cary, NC, USA) and Matlab 7.4 (The MathWorks Inc., Natick, MA, USA).
Scatterplots of paired sets of measurements were created with the line of equality plotted in order to visualize the data and assess potential systematic biases between the measurements. The paired t-test with an alpha level of 0.05 was used to check for systematic bias by testing whether the average difference between measurements was different from zero. If so, this would indicate that one set of measurements was on average higher than the other.
In cases of no systematic bias, 95% prediction intervals for the measurement of a second observer given the measurement of the first were obtained by randomly permuting both measurements to avoid one of the observers being systematically chosen as the first observer. This was repeated 1000 times, and each time the 95% prediction intervals for the regression of the difference between the two measurements divided by their mean (i.e. the proportional difference, expressed in %) against the first measurement were back transformed to obtain 95% prediction intervals for the second measurement with respect to the first. These prediction intervals were averaged to obtain the final prediction intervals. For CRL, this analysis was only performed using the first measurement obtained by each observer.
To assess the strength of the absolute agreement within and between observers, the intraclass correlation coefficient (ICC) was used based on a two-way random effects analysis-of-variance model10. High agreement is evidenced by a high ICC (close to 1).
The difference between pairs of measurements was plotted against their mean in a Bland–Altman plot11, 12. The average difference between measurements was calculated, together with the 95% limits of agreement (LOA), which are defined as the average difference (which is assumed to be zero in case of no consistent bias) ± 1.96 SD12. The normality assumption and the assumptions of constant mean and variance for the LOA were checked. For all comparisons, the assumption of constant variance was violated as the magnitude of the difference increased with the mean, such that the LOA become invalid. We therefore used the proportional difference rather than the difference itself in the Bland–Altman plots. The lack of agreement between measurements or observers becomes relevant only when the LOAs are wider than what is clinically acceptable.
A total of 54 patients who underwent a first-trimester transvaginal ultrasound scan were included in the study. All pregnancies were examined by Observer A, and 44 of these were also examined by Observer B. Interobserver agreement of CRL and MSD was assessed in the subset of 44 patients with measurements performed by both observers. Maternal age ranged from 25 to 41 years. The largest proportion of scans (64%) was performed in women aged 30–35 years. The majority of patients were Caucasian (52/54). There was one patient of Afro-Caribbean origin and one of Asian origin, who were both included in the group of 44 patients in which interobserver variability was investigated. Measurements of all variables were possible in all women, but in one case the MSD data obtained by the second observer could not be retrieved. Gestational age ranged from 40 to 68 days. Descriptive statistics for the CRL and MSD measurements obtained by each observer are presented in Table 1.
Table 1. Descriptive statistics for measurements of crown–rump length (CRL) and mean gestational sac diameter (MSD) obtained by Observer A and Observer B
Mean (range) (mm)
Observer A first
Observer A second
Observer B first
Observer B second
Figures 1 and 2 show scatterplots of pairs of measurements and 95% prediction intervals for CRL and MSD within and between observers. All points lie randomly around the line of equality in each graph, therefore there seems to be no systematic bias. The paired t-test confirmed that the observed bias was not significantly different from 0 for all four comparisons. The data points are close to the line of equality for CRL, which reflects a good agreement of the measurements within and between observers, while interobserver agreement for MSD is lower.
Based on the 95% prediction intervals that were generated, given a CRL of 6 mm as measured by one observer, the measurement for the second observer is likely to range from 5.4 to 6.7 mm (Table 2). Similarly, given an MSD of 20 mm as measured by one observer, the measurement for the second observer is likely to range between 16.8 and 24.5 mm (Table 3).
Table 2. Prediction intervals (PI) for crown–rump length (CRL) measurement by second observer given a CRL value measured by first observer
CRL as measured by first observer (mm)
95% PI for CRL as measured by second observer (mm)
PIs were derived using only the first CRL measurement of each observer.
Table 3. Prediction intervals (PI) for mean gestational sac diameter (MSD) measurement by second observer given an MSD value measured by first observer
MSD as measured by first observer (mm)
95% PI for MSD as measured by second observer (mm)
Bland–Altman plots with 95% LOAs ( ± 1.96 SD) for intraobserver agreement of CRL measurements for both observers and interobserver agreement for measurements of CRL and MSD are presented in Figures 3 and 4. All average proportional differences between measurements were close to 0, indicating that there was no systematic bias in the measurements within or between observers (all t-tests were non-significant, as mentioned earlier). The mean intraobserver difference for CRL measurements was − 1.15% and ± 0.41% for Observers A and B, respectively, and the mean interobserver difference was + 0.25% for CRL and + 0.93% for MSD. For intraobserver agreement of CRL the 95% LOAs were ± 11.37 and ± 8.91% for Observers A and B, respectively (Figure 3), and they were ± 14.64% for interobserver agreement of CRL (Figure 4a). The interobserver 95% LOAs for measurement of MSD were ± 18.78% (Figure 4b).
Intraobserver ICC for CRL measurement was 0.993 and 0.992 for Observers A and B, respectively, and the interobserver ICC was 0.993. The interobserver ICC for MSD was 0.952.
We have shown that measurements of CRL in early pregnancy before 9 weeks' gestation are associated with high intra- and interobserver ICCs, and that the interobserver agreement for CRL measurements is better than that for MSD measurements. The measurements of MSD that are specifically used when making a diagnosis of miscarriage were found to be less reproducible between two observers when performed in the same gestational age range.
A strength of this study is that it is the first to report on intra- and interobserver agreement of landmarks in the early first trimester of pregnancy between 6 and 9 weeks' gestation with transvaginal ultrasound. Our study was performed with modern ultrasound equipment, whose high image resolution allowed the best possible delineation of the embryo and gestational structures available to date.
However, we acknowledge that this study has limitations. The population studied was relatively small, especially for very early gestations where the CRL is below 10 mm (7 weeks). It appears that the smallest CRL measurements are closest to the line of equality, and that the reproducibility at 6 or 7 weeks could possibly be better than it is later in the early first trimester (8 weeks). At these very early stages intra- and interobserver differences in CRL measurements amount to tenths of millimeters, which would be unlikely to result in a difference of more than 2 days when dating a pregnancy8. It seems clear that CRL measurements below 8 weeks are more uniformly made, since the embryo is straight. From 8 weeks onwards, measurement differences are possible owing to the curvature of the embryo. Possible measurement errors then range from + or − 2 mm in an embryo of 20 mm and from + or − 3 mm in an embryo of 30 mm, which corresponds to a difference in dating of 2 or 3 days, which is important when considering possible consequences of inaccurate dating later in pregnancy8. However, in relation to making a diagnosis of early miscarriage, even a difference in CRL measurement of as little as 1 mm can have an impact on clinical decision-making.
The issue of intra- and interobserver variability of measurements in the first trimester of pregnancy has not been explored widely in the literature to date. Nevertheless a recent paper showed the influence of ethnicity and maternal age on the rate of change of CRL and MSD13, while another study showed that the requirement to re-date an early pregnancy may be predictive of later growth restriction14. It seems likely that the accuracy of early CRL measurements will become of increasing importance. In order to improve the accuracy of dating from 5 + 3 weeks onwards, we constructed a new reference CRL size curve based on a large number of pregnancies, which showed meaningful differences of several millimeters from the classically used Robinson and Hadlock curves, especially at very early gestations8. This study highlighted the fact that research to investigate the intra- and interobserver variability of CRL measurements using transvaginal ultrasound was necessary.
A key issue regarding the reproducibility of embryo and gestational sac size measurements relates to the definitions used for miscarriage. A miscarriage is defined as an embryo with a CRL ≥ 6 mm with no visible cardiac activity or an MSD of ≥ 20 mm where no yolk sac or embryo can be seen1. Accordingly significant interobserver variability may be associated with a misdiagnosis of miscarriage and lead to inadvertent termination of pregnancy. If we consider the 95% prediction intervals from our data, a measured MSD value of 20 mm by the first observer corresponds to a range of 16.8–24.5 mm for the second observer. This is clinically important. Similarly, for a measured CRL of 6 mm, the second observer's measurement ranges from 5.4 to 6.7 mm.
Our data show that the variability of CRL measurements—both for a single observer and between observers—is smaller than the interobserver variability in MSD measurements. However although small, these differences may have significant clinical consequences. It is clear that whatever single cut-off value may be used to define a miscarriage, great care must be taken when measurements approach the decision boundary. In general little harm is associated with repeating a scan at a later date when deciding about the potential viability of a pregnancy. In the future we suggest that for any proposed cut-off value for CRL or MSD to define miscarriage, possible variations in measurement accuracy are taken into account before diagnosing it on the basis of one scan. Hence in the UK, an MSD of 20 mm to define miscarriage would become 25 mm to take into account possible measurement error. In this way the risk of terminating wanted viable embryos should be minimized.
J.L. is a postdoctoral researcher from the Research Foundation—Flanders (FWO), B.V.C. is a postdoctoral researcher from the Research Foundation—Flanders (FWO). T.B. is supported by the Imperial Healthcare NHS Trust NIHR Biomedical Research Centre. Research supported by Research Council KUL: GOA Ambiorics, GOA MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC); Research Foundation—Flanders: FWO G.0302.07 (SVM), G.0341.07 (Data fusion); IWT: TBM070706-IOTA3; Belgian Federal Science Policy Office IUAP P6/04 (DYSCO, ‘Dynamical systems, control and optimization’, 2007-2011).