Intra- and interobserver repeatability of femur length measurement in early pregnancy




To assess the intra- and interobserver reproducibility of songographic measurement of fetal femur length between 10 and 16 weeks of gestation.


Femur length was measured three times by the same trained observer in each fetus of 136 pregnant women. A second trained observer then repeated the measurements. The coefficient of variation, intraclass correlation coefficient and repeatability coefficient with 95% CIs were calculated for each observer and between the two observers.


The inter- and intraobserver repeatabilities of femur length were good. For interobserver correlation, the coefficient of variation was 4.6% (95% CI, 3.0–6.2), the intraclass correlation coefficient was 0.82 (95% CI, 0.69–0.95) and the repeatability coefficient was 2.1 (95% CI, 1.8–2.7). For intraobserver correlation, the coefficient of variation was 4.2% (95% CI, 3.2–5.6), the intraclass correlation coefficient was 0.91 (95% CI, 0.75–0.97) and the repeatability coefficient was 3.23 (95% CI, 2.33–3.86) for Observer 2. Similar results were obtained for the other observer.


Transvaginal femur length measurement is technically feasible and easy to perform between 10 and 16 weeks of gestation. The high degree of intra- and interobserver repeatability indicates it to be a reproducible method. Copyright © 2004 ISUOG. Published by John Wiley & Sons, Ltd.


Femur length (FL) measurement has been reported as a sonographic marker in screening for trisomy 21 with a detection rate varying from 10 to 70%1, 2. This variation can be accounted for by several factors: gestational age at ultrasound scan, differences in study design, and a possible discrepancy related to methodological and technical differences in the way in which the measurement was performed by various observers1–8. The discriminatory ability of a diagnostic test depends not only on the technical capacity of the test to identify the risk of pathology but also on the repeatability of the test. Consequently, poor repeatability limits the clinical value of the test.

Differences in fetal measurements can arise from real differences between fetuses, real changes within a fetus, and differences between the real and observed values due to inaccuracy in the measurement procedure. The latter can be caused in part by the observer, mainly because an observer can fail to repeat his/her own measurements or because there might be disagreement between different observers.

The aim of our study was to assess the repeatability of transvaginal sonographic measurement of FL performed in early pregnancy.


Our study population consisted of 136 consecutive women with singleton pregnancies at 10–16 completed weeks of gestation, attending our prenatal unit to undergo genetic amniocentesis in the period February 1995 to August 1999. The mean (±SD) maternal age was 36.6 ± 2.8 (range, 23–46) years and the gestational age at the time of the transvaginal scan was 13.3 ± 0.6 weeks. Gestational age was determined from the date of the last menstrual period. Exclusion criteria included: irregular menstrual cycle, last pregnancy or intake of contraceptive pill less than 3 months previously, stillbirth, fetal malformations detected either by transvaginal scan in early pregnancy, by the transabdominal route later in pregnancy, or at delivery, abnormal fetal karyotype, and incomplete FL data. All women consented to participate in the study.

Genetic amniocentesis was performed in all cases at 15–18 weeks' gestation. Transvaginal ultrasound examinations were performed on real-time linear array ultrasound equipment (Sonolayer SSA–270A and SSA–340A, Toshiba Corporation, Tokyo, Japan), with a 5.0–7.0-MHz convex probe with a sweep angle of 86° and 121°, respectively. The FL was measured from a sagittal section of the bone along its long axis from outer to outer margin, as previously described9, 10.

The measurements were expressed in decimals of millimeters and were performed by two experienced sonographers (P.R., Observer 1 and L.G., Observer 2). Each observer attempted to obtain three measurements with different time intervals between each. The sonographers were blinded to their own measurements by covering the numeric display. In addition, they were not allowed to watch each other performing the measurement to remove any possible influence on the second observer.

Statistical analysis

To assess intraobserver repeatability three independent bone measurements were made in each fetus by each observer. To assess interobserver agreement the two investigators examined every fetus in arbitrary order. The SD obtained expresses the total variance of the measurement11. The Shapiro–Francia test was used to assess the distribution of FL measurements12 and all the measurements fitted a normal distribution. We calculated the coefficient of variation (CV), which, expressed as a percentage, is the ratio of the SD and the overall mean13. In all calculations, one measurement value for each case was used, i.e. the mean of the three replicates for each case evaluated by each observer. We also calculated the repeatability coefficient and the intraclass correlation coefficient (ICC). The ICC is the measure of concordance when a variable is continuous and corrects correlation for systematic bias14. It is a statistic that describes the reproducibility of repeated measures in the same population and indicates true variance as a fraction of the total variance. Two separate models were assessed to calculate the intra- and intercorrelation coefficients. For these models, we calculated the total sum of squares (SST) and the sum of squares between observers (SSB) and within observers (SSW), derived with a repeated measure analysis of variance. In the intraobserver model SST was given by SSB and SSW; in the interobserver model SST was given by SSB, SSW and sum of squares given by the interaction of observer measurements, as follows: IntraCC = ((k × SSB)—SST)/(SSB + SSW); InterCC = ((k × SSB)—SST)/(SSB + SSW) + SS interaction of observer measurement, where k is the number of measurements in each subject. A model with subject and observer effect with evaluation of possible interaction effect was used to calculate the values (i.e. the variances) needed for the formula. This calculation was made using all the repeated measures obtained. The value of ICC lies between 0 and 1; an ICC of 1 for repeated measurements indicates perfect reproducibility between observers or measurements, while a value of 0 is interpreted as reproducibility that is no better or worse than that expected by chance. We considered repeated measurements with an ICC value in excess of 0.75 as being clinically useful. Repeated measurements with an ICC value of less than 0.7 are probably not useful in a clinical diagnostic evaluation. The repeatability coefficient is the maximum difference that is likely to occur between repeated measurements. It is defined as 1.96 × √(2s2 w) and is quoted in terms of the original units of the data without any transformation. Confidence intervals for the ICC, CV and repeatability coefficient were calculated according to the methods of Scheffe15. Stata 7.0TM (Stata Corp. College station, TX, USA) was used to perform the statistical analyses.


Intra- and interobserver repeatabilities of FL measurements are shown in Table 1.

Table 1. Intra- and interobserver correlations for transvaginal sonographic measurement of femur length (n = 136)
 CV (95% CI)ICC (95% CI)Repeatability coefficient (95% CI)
  1. Three measurements were made by each observer for each of the 136 fetuses. CV, coefficient of variation; ICC, intraclass correlation coefficient.

 Observer 15.9 (4.9–7.0)0.88 (0.69–0.96)3.35 (2.21–4.01)
 Observer 24.2 (3.2–5.6)0.91 (0.75–0.97)3.23 (2.33–3.86)
 4.6 (3.0–6.2)0.82 (0.69–0.95)2.1 (1.8–2.7)


FL measurement is technically feasible by transvaginal ultrasound examination from 8 weeks of gestation16. Lower than expected FL measurements can be considered in the second trimester of pregnancy as sonographic markers of fetuses at risk for Down syndrome1, 2, 17, 18. The detection rate using this sonographic marker has varied considerably, probably due not only to differences in study design but also to potential methodological and technical differences in the way in which measurements were performed by various observers. It follows that there is poor repeatability of the measurement, which can adversely affect the sensitivity and specificity of the screening test14. Moreover, in early pregnancy FL measurements are very small and errors in the technique could therefore have an even greater impact.

We used various indices and coefficients to assess intra- and interobserver repeatability, since there is not general agreement about the correct statistical approach to use19. For example, there is no consensus in the scientific literature regarding the interpretation of ICC values20 and the widely employed CV does not seem to be sufficient on its own to assess repeatability. Nevertheless, values for InterCC and IntraCC above 0.75 are said to be acceptable14. Since the ICC is highly dependent on the variance in the population and the variances between different study populations will not be the same, the ICC values obtained in a study may not be comparable with those observed in other studies. ICC values should always be considered accompanied by corresponding CV values: a good ICC should have a corresponding low CV value.

In our study the interobserver correlation for the transvaginal sonographic measurement of FL revealed a CV of 4.6% (95% CI, 3.0–6.2) and an ICC of 0.82 (95% CI, 0.69–0.95). For the intraobserver correlation the CV was 4.2% (95% CI, 3.2–5.6) for Observer 2 and similar results were obtained for Observer 1. The low CV and high ICC revealed good repeatability and reproducibility of this biometric parameter in early pregnancy. Similarly, the repeatability coefficient for the interobserver measurements for the FL was 2.1 (95% CI, 1.8–2.7) and for the intraobserver measurements it was 3.23 (95% CI, 2.33–3.86) for Observer 2 and 3.35 (95% CI, 2.21–4.01) for Observer 1. Thus the error for any particular measurement can be considered to be negligible.

Further studies need to be performed to determine the clinical validity of these measurements in a high-risk patient population, especially in screening for chromosomal abnormalities, particularly Down syndrome. A prerequisite for these studies is the availability of reproducible FL measurements.