The range and use of ultrasound fetal measurements have gradually been extended. Measurements have been combined to estimate fetal weight by mathematically based non-linear regression analysis or physically based volumetric methods. Fetal weight estimation is inaccurate, with poor sensitivity for prediction of fetal compromise. Several authors have shown the unacceptable level of intra- and interobserver variability in fetal measurement and the impact of errors on growth assessment. The aims of this study were to review the available methods and possible sources of inaccuracy.
Four databases were searched for studies comparing ultrasound estimated fetal weight (EFW) with birth weight. Studies meeting the inclusion criteria evaluated 11 different methods. Errors were graphically summarized.
No consistently superior method has emerged. Volumetric methods provide some theoretical advantages. Random errors are large and must be reduced if clinical errors are to be avoided.
Many researchers have attempted to estimate fetal weight using single or combined ultrasound measurements of the fetus. Knowledge of expected birth weight is attractive to clinicians as it is an important variable affecting perinatal mortality1. Fetal weight estimation is thought to be helpful in predicting fetal survival and making management decisions in the very low birth weight group (< 1000 g)2 and in managing the delivery of the large baby, where complications may occur3.
The most successful early approach was a simple correlation between abdominal circumference (AC) and birth weight4. Numerous further attempts have combined measurements in regression equations or volumetric formulae, with varying degrees of accuracy. Several of these methods have insignificant systematic errors, but random errors (as measured by the standard deviation of errors) of less than about 7% are rarely reported.
The aim of this study was to review the literature relating to methods for the estimation of fetal weight and the possible sources of inaccuracy in estimated fetal weight (EFW).
The literature review followed the methods detailed in the Cochrane Reviewers' Handbook5. The Handbook suggests that several reviewers are employed; only one reviewer performed this review and this may be a limitation. The primary aim of the review was to summarize the relevant available evidence regarding the methods for fetal weight estimation without imposing the values and preferences of the reviewer.
Studies where either selected or unselected groups were recruited were included in the review on condition that the study group was defined in terms of selection criteria, scan-to-delivery interval and birth weight distributions.
Study designs and comparisons
Studies included compared ultrasound EFW with birth weight to determine the accuracy of EFW. No randomization was expected within the studies as they were based on direct comparison between the fetus, using ultrasound, and the neonate. The primary measures of EFW accuracy compared were mean percentage error and the SD of percentage errors; some authors reported errors in grams.
Locating and selecting studies
Four databases, MEDLINE, EMBASE, ZETOC and The Cochrane Library, were searched and reference lists from selected papers were reviewed for further relevant studies. The initial search was broad, focusing only on the term ‘fetal weight’. Studies were excluded where it was clear from the title that they were either not relevant or not valid, e.g. animal studies and pharmaceutical studies. Abstracts of the remaining studies were then reviewed for relevance and validity and further studies excluded. Where abstracts contained insufficient information to assess relevance and validity or to allow recording of the required review data, full papers were obtained. Non-English language literature was excluded at this stage, unless English abstracts provided sufficient information.
Studies were excluded on the grounds of validity where bias had been introduced, usually by selection, e.g. where fetal weight estimates were performed more than 7 days prior to delivery and no indication of the mean and range of times to delivery was given.
Only studies using EFW based on widely used fetal measurements were included, i.e. biparietal diameter (BPD), femur length (FL), head circumference (HC) and AC.
Data collection and analysis
Data were collected using data collection forms, piloted and refined on a number of studies. Data were stored and analyzed in a standard spreadsheet package (Microsoft Excel) and were inspected for patterns and similarities before being grouped for presentation in summary form. Data were presented in two ways, as follows, to address each aim outlined previously: (1) graphically, to show the accuracy of available methods for the estimation of fetal weight and (2) descriptively, to summarize the possible sources of errors in EFW that have been assessed in the studies reviewed.
The estimation methods of 11 research groups were included in the review; a range of formulae for EFW have been developed, based on single or multiple fetal measurements. The originating groups and other independent groups have prospectively compared the performance of these formulae, using complete cross-sections of the clinical population, sometimes stratified into weight bands, and two high-risk populations—preterm or small-for-gestational age fetuses and macrosomic fetuses. Results from the patient cohort from which formulae were developed were excluded.
Two, otherwise valid, studies were excluded where the time between EFW and delivery was more than 7 days and no mean time to delivery was reported6, 7 since these studies may have been biased, potentially underestimating birth weight due to significant growth. Several studies were excluded where results were inappropriate for comparison, e.g. correlation coefficients or mean absolute errors, rather than mean errors, were reported. Two further excluded studies8, 9 had applied a correction for fetal growth between weight estimation and delivery (28 g/day and 25 g/day); this would introduce a systematic difference in comparison with other studies.
Studies based entirely on Asian cohorts were excluded on the grounds that there may be bias in comparison with multiethnic studies; methods have been developed specifically for these populations10, 11.
Relevant results were recalculated in two studies where data were available. In one case there were mathematical or transcription errors in a table where results were otherwise consistent12. In another study it appeared that an incorrect formula had been used for EFW but individual data had been tabulated allowing recalculation13. It is of note that data were available in these papers to detect and correct the mathematical errors; many papers reported only limited summary data.
Where studies consistently demonstrated systematic errors greater than 5% for a particular method, this method was excluded from further analysis14–16.
Results of the included studies have been reported in two main formats. Most authors have calculated percentage errors in individual weight estimates with respect to birth weight before calculating the mean and SD of errors. Three studies where errors were calculated as a percentage of EFW, rather than birth weight, were excluded17–19 as comparison with other studies was not possible. Some authors have expressed the mean and SD in grams. Where authors reported the standard error of the mean, or CIs, these were converted to SD for ease of comparison.
For comparison of performance in the general population and in risk groups, data were categorized into three groups. Where patients were drawn from the full range of birth weight encountered in routine practice, these were included in the ‘normal clinical populations’ category. Studies or subgroups where errors in EFW were reported for smaller or larger birth weight were included in the ‘low birth weight’ and ‘high birth weight’ categories, respectively. Included studies used different definitions for low birth weight groups, ranging from ‘less than 1000 g’ to ‘small-for-gestational age’, the latter having a mean and range of birth weight of 2420 g and 478–3216 g, respectively. High birth weight was defined as birth weight greater than 3999 g, with the exception of one study where birth weight was greater than 2500 g20.
The methods and studies included in the review are listed in Figures 1 to 3.
Normal clinical populations
Figure 1 shows the mean and SDs of error in EFW from the studies that included a broad range of birth weight.
Comparison of the performance of the methods shows those of Hadlock et al.21, 22 to provide generally more consistent mean (systematic) errors across the selected studies, with comparable random errors (SD). The results of Sabbagha et al.23 provide an exception with a mean error of −6.7% for a three-parameter (AC, FL, HC) method. The volumetric method of Combs et al.24, with mean and SD of 0.1% and 9.1%, respectively, in a single study55, merits further evaluation.
The remaining methods are less consistent between studies. There is no clear relationship between study size and errors.
Low birth weight
Figure 2 shows the mean and SDs of error in EFW from the studies including low birth weight groups.
Comparison of the performance of the methods shows a wide variation of systematic and random errors. The method of Sabbagha et al.23 has the smallest systematic errors (1.7% and 2.8%) and SD (8.1% and 9.1%)46, 70.
High birth weight
Figure 3 shows the mean and SDs of error in EFW from the studies including high birth weight groups.
Comparison of the performance of the methods shows a general tendency to underestimate fetal weight. It is difficult to draw any firm conclusions from these data.
Using three-dimensional (3D) ultrasound it is now possible to generate volume datasets of the fetus. The field of view of 3D ultrasound is currently limited so that in the second and third trimesters it is not possible to image the whole fetus in one dataset. There is prior evidence, however, that fetal limb cross-sectional measurements and volumes may be valuable in estimating fetal weight25, 26. Several groups have developed formulae relating the volume of one or more fetal body parts to fetal weight27–29; all achieved random errors (SD) of approximately 6–7%, a marginal improvement on the cross-sectional methods. 3D ultrasound is time consuming and is not widely available. If further developments allow faster and more accurate weight estimation, demand for the technology will increase.
Factors affecting the accuracy of estimated fetal weight
In the management of delivery, EFW must be accurate to within a few percent in all cases. Thresholds may be used to determine the mode of delivery; where errors often exceed 10% these thresholds have no value.
In the prediction of low birth weight for the purposes of intervention well in advance of delivery, high sensitivity is of primary importance where further tests of fetal well-being, such as Doppler ultrasound30, are available. Reducing random errors will increase specificity and make diagnosis more cost-effective.
It is clear from Figures 1–3 that there is a variation in systematic and random errors between methods and between centers. Figure 4 shows that SD values cluster for each study. This implies that there are local factors influencing random errors, such as the study population, the observers, measurement protocols, the equipment or a combination of these variables. A number of authors have analyzed these factors and questioned the validity of formulae.
Most studies have defined their populations in terms of ethnic groups and birth weight distributions. The studies included above were based on typical Western populations of the local area, with a variable proportion of different ethnic groups. Figures 1–3 suggest that systematic and random errors may be larger for small fetuses and that higher birth weight is underestimated; it follows that birth weight distribution within study groups will affect errors.
There are, therefore, factors in study populations that affect results. However, a successful fetal weight estimation method must be accurate at all weights and in all ethnic groups either by universal validity or by targeting, using different formulae for different groups. There is evidence that methods developed in one population may be appropriate for others31. Targeting formulae for ethnic groups is difficult where groups are mixed and other social factors are involved. Formulae targeted to different sizes have disadvantages if used to monitor growth, as a change in formula between measurements may introduce discontinuities into the reference growth curve. A broad, flat distribution of birth weight, rather than a typical clinical population, may be preferable in derivation groups.
Maternal and gestational factors
Maternal body mass index32, 33, fetal sex34 and multiple pregnancy19 are apparently not a significant influence on measurement error. There is conflicting evidence regarding the influence of amniotic fluid volume32, 35, 36. One would expect maternal adiposity and amniotic fluid volume to affect the accuracy of individual measurements as these factors both affect image quality. It is possible that these effects are masked by other, larger sources of error.
Townsend et al.36 included a subjective assessment of image quality in their study of EFW in low birth weight infants. EFW from good quality images had smaller random errors (SD 8.9%) compared to fair (SD 13.6%) and poor (SD 15%) images.
Operator experience is important in producing accurate fetal weight estimates. Predanic et al.37 demonstrated the learning curve in estimating fetal weight; there were significant improvements in accuracy amongst residents in training up to 24 months, where the best performance was achieved. Even with experience, there are interobserver differences in measurements. Gull et al.38 showed that averaging the results of two examiners reduced the mean absolute error in EFW by approximately 17% (from 6.1% to 5.1%).
Chang et al.39 measured intra- and interobserver errors in a series of 40 patients, scanned by two experienced sonographers. Each measurement included in the interobserver analysis was an average of three measurements of the same parameter. They reported intraobserver differences (SD) of less than 1 mm for linear measurements and approximately 4 mm for circumferences, leading to intraobserver differences of less than 75 g in EFW. Interobserver differences were less than 2 mm for linear measurements and 6–8 mm for circumferences, leading to interobserver differences of less than 85 g, or approximately 3.5%, in EFW. If the random error (SD) in EFW, in experienced hands, is in the region of 7–8%, then approximately 50% of this may be attributed to observer variability. Where less experienced or pressurized sonographers are involved or measurements are not averaged, the contribution of observer variability will be greater, as will the cumulative random error.
Stetzer et al.40 attempted to remove any observer error associated with ultrasound from the derivation of a formula for EFW. They measured 231 neonates, taking an average of three measurements for each variable, and used 115 subjects to derive a formula. This was tested on the remaining 116 babies. Unfortunately systematic and random errors were not reported, but as 72% of estimates were within ±5% and 96% were within ±10% the SD (of a corresponding normal distribution) would be approximately 5%. By removing errors associated with ultrasound measurement, random errors were reduced by a significant amount.
Dudley and Potter41 developed a strategy for improving the quality of fetal measurements. Images of each head and AC measurement made were collected continuously and a sequential sample audited against widely accepted quality criteria. Sonographers were provided with feedback on the number of satisfactory measurements and on the quality criteria not met. Recognition of quality criteria improved and, with coaching, the proportion of images meeting all quality criteria increased.
The audit was extended to a further five centers to determine whether quality varied significantly42. This study established that there was considerable variability in measurement quality between centers and that performance could be improved. There were differences of up to 18 mm between AC measurements made on optimal and suboptimal images on the same patient.
These studies have confirmed the findings of Gull et al.38 and Chang et al.39 that measurements, in particular AC, are variable between operators, highlighting the recognition of quality criteria as a cause.
Protocols and equipment
Since the earliest ultrasound estimations of fetal weight, equipment has changed considerably. In addition to improvements in image quality, measurement systems have developed. In the 1970s and early 1980s, many measurements were made from photographs of the ultrasound image, either using a computer following digitization or using map measurers. Some circumferences were directly traced and others were calculated from two orthogonal diameters. On modern equipment, measurements are made on-screen using electronic calipers; circumferences may be measured by direct tracing or ellipse fitting, or calculated from diameters.
Measurement from photographs is subject to a number of possible errors. The most important factor is calibration to provide a conversion from image scale to fetal scale. If this conversion is correct, the measurement is then subject to the same sources of error as on-screen measurement, such as correct identification of structures and placement of calipers. The most significant influence on measurement differences may be the choice between direct tracing and calculation from diameters (which is equivalent to the ellipse fitting method). There is good evidence that there are systematic differences between the methods43, 44 but on modern equipment the differences may be smaller42.
Smulian et al.45 investigated the effect of the three methods of circumference measurement on EFW (using a formula incorporating HC and AC). They showed traced circumferences generated larger systematic and random errors compared to the other two methods; this is to be expected as tracing has a much stronger dependence on the quality of the measurement system and on operator dexterity. In their conclusion they stated that there were larger errors in smaller fetuses, attributing this to the difficulty of manipulating the caliper around the tighter curve; this implies a failure to make appropriate use of image magnification, another important factor in measurement accuracy.
Dudley and Chapman42 confirmed that the choice of circumference measurement method will have an impact on EFW. It is important to note, however, that in this study 22% of ACs were not elliptical. Where AC was measured using both trace and ellipse fitting methods, one or both measurements did not closely adhere to the abdominal outline in 43% of cases. Automation of image measurement may be the solution.
Clearly there are clinically important errors in fetal measurement related to the quality of operator training and performance that can be addressed. There may also be errors due to equipment and, if improvements are to be made, these also require evaluation and reduction or elimination. The properties and operation of the electronic calipers used to make measurements are important; the associated errors require evaluation and careful management.
Robson et al.46 found a systematic overestimation using their own EFW method in a prospective series, compared to the derivation series. The principal investigator performed most of the scans, but the studies were performed in two centers with different equipment. Having excluded causes such as interoperator error and scan-to-delivery interval they suggested that the most probable source of error was a difference in calibration of the equipment.
Dudley and Griffith47 assessed the accuracy of a range of equipment using an open-topped test object with a circular arrangement of targets. They showed highly significant errors (15%) in a single ultrasound machine. Further work has shown both the errors in other machines (up to 6%) and the considerable variations in the reproducibility of measurements (coefficient of variation in 10 measurements of 0.2% to 0.9%)48.
Validity of formulae
A number of authors have questioned the validity of EFW formulae on the grounds of the variability of fetal body composition and relative proportions, mathematical and physical quality.
Bernstein and Catalano49 suggested that the relatively low density of fat leads to overestimation of fetal weight in diabetic mothers, supported by data showing that 22 neonates where weight was overestimated by more than 10% had greater skinfold thickness, ponderal index and body fat than eight neonates where weight was underestimated by more than 10%. This assertion is not supported by available data on the density of fat. If the average fetal density is 1.07 g/mL50 and the density of fat is 0.97 g/mL51, a change in fat mass by 5% of body mass gives an overall density change of only 0.5%. It may be that measurements of AC, HC and FL do not account for increased soft tissue mass in the limbs of the larger fetus, which would lead to underestimates of weight as shown in Figure 3, but contrary to the findings of Bernstein and Catalano49. The incorporation of soft tissue measurements52 has had little success. 3D ultrasound measurements of upper arm and thigh volume may provide improvements in the future53.
Fetal proportions change throughout pregnancy and in intrauterine growth restriction. Regression equations may not take adequate account of this. A number of authors have proposed volumetric measurements51 or formulae to overcome this possible limitation. Dudley et al.54 developed a formula based on abdominal area, head area and FL and later adapted it to use AC and HC55 but it has not been evaluated in studies meeting the inclusion criteria for this review. Shinozuka et al.56 produced a formula based on neonatal specific gravity and volume but did not present a full analysis of errors. Combs et al.24 developed a similar method but did not present random errors. Dudley55 showed the latter method to produce similar results to the methods of Hadlock et al.22.
The use of volume as the basis for weight estimation has been validated using magnetic resonance imaging (MRI). Baker et al.57 reported a systematic error of 0.4% and a SD of 5.1% (reduced to 0.04% and 4.9% if their regression line is forced through the origin). Uotila et al.50 reported a better correlation between MRI EFW and birth weight (r = 0.95) than ultrasound EFW (Hadlock et al.22) and birth weight (r = 0.77).
Jackson et al.58 questioned the regression equation approach on the grounds of the confounding effects of skewness, kurtosis, outliers and the repeated use of variables on the outcome of regression. They provided an alternative, volumetric equation, with smaller mean absolute errors (7.2% cf. 7.9%) than the method of Hadlock et al.22.
Chuang et al.59 used an artificial neural network, including gestational age and presentation as variables in addition to ultrasound measurements, to reduce mean absolute errors from 7.5% by the method of Hadlock et al.22 to 6.2%. This is not a clinically significant improvement.
Impact of accuracy on the development of interventions
The main aim of antenatal care is the prevention of morbidity and mortality. Prevention requires intervention in those patients at risk. If identification of the risk group lacks sensitivity and specificity, any trial of intervention will be compromised. There are examples of trials being limited by the inaccuracy of fetal weight estimation. Wallace et al.60 curtailed a trial of delivery route for the low birth weight infant as numerous subjects were incorrectly entered into the trial on the basis of an EFW in the range 750–1500 g. In a systematic review of randomized trials for management of delivery in suspected macrosomia, Irion and Boulvain61 concluded that the inaccuracy of EFW was a limitation of a policy for induction for suspected macrosomia.
Any trial where subjects are selected on the basis of EFW will have serious limitations owing to the accuracy, sensitivity and specificity of the technique. This is a major obstacle to progress in the prevention of adverse outcomes.
No preferred method for the ultrasound estimation of fetal weight has emerged from this review. The size of the random errors remains a major obstacle to confident use in clinical practice, with 95% CIs exceeding 14% of birth weight in all studies.
Population differences, maternal factors and variations in fetal composition are probably minor issues in the context of the current large random errors in EFW. Image quality is an issue that may be overcome, at least in part, by technological developments such as harmonic imaging. The use of volume to estimate fetal weight has been validated using MRI. Volumetric equations may have advantages in taking account of varying fetal proportions and avoiding the mathematical pitfalls of regression equations.
Measurement methods and observer variability make a major contribution to systematic and random error; standardization and excellent training are necessary.
Efforts must be made to minimize this variability if EFW is to be clinically useful. This may be achieved through averaging of multiple measurements, improvements in image quality, uniform calibration of equipment, careful design and refinement of measurement methods, acknowledgment that there is a long learning curve, and regular audit of measurement quality. Further work to improve the universal validity and accuracy of fetal weight estimation formulae is also required.