How should randomised trials including multiple pregnancies be analysed?


*Dr S. Gates, National Perinatal Epidemiology Unit, Institute of Health Sciences, Old Road, Oxford OX3 7LF, UK.


Objective  To compare the effects of four methods of analysis on the results of randomised controlled trials that recruit women with multiple pregnancies and measure outcomes on their babies.

Design  Analysis of one real and two simulated data sets.

Setting  Secondary analysis of perinatal randomised controlled trials.

Population  Randomised controlled trials including women with multiple pregnancies.

Methods  The analytical methods compared were (a) assuming independence among babies, (b) analysing outcomes per women, counting a woman as having an outcome if any of her babies had it (equivalent to selecting the worst outcome among any of a woman's babies), (c) randomly selecting one baby from each set of multiples for inclusion in the analysis, (d) adjustment of the analysis to take account of non-independence of babies from multiple pregnancies, using methods developed for analysis of cluster randomised trials.

Main outcome measures  Odds ratios for trials' main outcomes.

Results  Results from application of cluster trial methods were similar to those from assuming independence among babies, but with slightly wider confidence intervals, reflecting the reduced effective sample size caused by non-independence between babies from the same pregnancy. Results were more variable using the other two methods, and in some cases, departed markedly from the results of the cluster trial methods.

Conclusions  Cluster trial methods provide a simple way of adjusting the analysis to take account of non-independence between babies from the same pregnancy. Random selection and analysis by pregnancy (methods (b) and (c)) have disadvantages and do not report outcomes for all of the babies in the trial. This may cause problems with incorporating trials analysed using these methods into systematic reviews.


Many randomised controlled trials in perinatal medicine evaluate treatments that are given to women antenatally or during labour but measure outcomes on their babies. Some treatments given to women are specifically intended to improve outcomes for the baby, but even where the main effect of the treatment is on the woman rather than the baby, data about the baby are often important secondary outcomes.

Antenatal recruitment and neonatal outcome measurement may cause a problem in the analysis when the trial population includes women with multiple pregnancies, because the outcomes of offspring from a multiple pregnancy are not independent. Babies from the same pregnancy are likely to have more similar outcomes than babies from different pregnancies, for several reasons: they will all be exposed to the same conditions before birth, they will be genetically similar or identical, and hence may react in the same way to interventions, and they may affect each other, so that one baby having a particular outcome may make the others in the same pregnancy more likely to have it.

Inclusion of non-independent data means that the “effective sample size” of the trial is reduced: there are fewer independent outcomes in the trial than the number of babies that took part in it. Analysing all babies as if they are independent will therefore overestimate the sample size and give confidence intervals that are too narrow. The extent to which the effective sample size is reduced will depend on the degree of dependence between babies from multiple pregnancies and the proportion of multiples recruited to the trial. If multiple pregnancies make up only a small percentage of the trial, there is probably little potential for them to affect the results. Some trials, although, recruit a substantial proportion of women multiple pregnancies. For example, in the Antenatal TRH trial,1 nearly 20% (44/225) of the women recruited had multiple pregnancies, and they contributed 34% (94/275) of the babies in the trial. Even where multiples make up only a relatively small percentage of the trial's population, it is common to analyse single and multiple pregnancies separately in subgroup analyses. In multiple pregnancy subgroups, there is obviously great potential for non-independent data to influence the analysis.

Here we discuss and compare the methods that have been used for analysis of data sets containing multiple pregnancies and suggest applying methods that have been developed for analysis of cluster randomised trials. These adjust the analysis to take account of non-independence. The methods are illustrated using data from the Antenatal TRH trial and two simulated trial data sets.

Probably the most common method of analysis is to ignore non-independence between babies from multiple pregnancies and to assume that each baby is an independent data point. The sample size used in the analysis will therefore be larger than the effective sample size, giving confidence intervals that are too narrow. However, estimates of the risk in each group, and hence the relative risk or odds ratio, are not affected by non-independence. This is because non-independence reduces the effective sample size of both the number of babies with an outcome and the total number by the same amount. For example, if 50/200 babies from 100 twin pregnancies have respiratory distress syndrome, and the effective sample size is half of the total number (i.e. there are 100 independent data points), the estimate of the risk would be 25/100, which is the same proportion (0.25) as 50/200. However, the confidence interval around 25/100 (0.18, 0.34) is wider than that around 50/200 (0.20, 0.31), because of the smaller sample size.

Assuming independence between babies has the advantages of being easy to apply and including all of the trial's data in the analysis.

An alternative method, used by some perinatal trials,2 is to use the number of women recruited as the denominator, counting a woman as having an outcome if any of her babies has it (analysis by pregnancy). This is equivalent to taking the worst outcome among any of a woman's babies as the outcome for that woman. This method avoids including non-independent data in the analysis, but it disregards part of the data set; where one of a set of multiples has an outcome, the others do not contribute to the analysis. There is therefore a cost in collecting data that are not used.

Analysis by pregnancy will often give a different risk estimate than assuming independence. For example, if one of every pair of twins died, this method would suggest 100% death, compared with 50% if all babies were included in the analysis. It addresses the mothers' risk of having one or more babies that die, rather than the babies' risk of death.

A further method that has also been used3 is to select at random one infant from each multiple pregnancy for inclusion in the analysis (random selection). This avoids including non-independent data, but again it excludes some babies from the analysis, and therefore some data are collected but not used. As for analysis by pregnancy, the sample size for this method will be equal to the number of women recruited, as only one baby from each pregnancy is included.

The random selection element in this method means that the result will not be exactly the same if the analysis is repeated. If there is, in reality, no difference between the groups, it is possible that, by chance, random selection will produce a spurious large difference. Similarly, a real difference may be obscured. These misleading results will be rare, but may not be detected unless the analysis is repeated several times. This then raises the issue of which result should be presented in the trial's publication.

The range of possible results that the random selection can give is constrained by the number of multiple pregnancies that all have the same outcome. For example, if among 20 sets of twins, in 8 sets both babies died, and in the remaining 12 sets, both survived, then random selection of one baby from each pair of twins will always find that 8/20 babies died. However, if the 16 deaths occurred in 16 different twin pregnancies (i.e. in 16 pregnancies, 1 baby died and in 4 neither died), then the random selection could produce anywhere between 0 and 16 deaths from the 20 sets of twins, with the average being 8.

In a trial comparing two groups, there is therefore a range of possible results (odds ratios or relative risks) using this method. There will be an “average” result, when each group has the average number of outcomes, and two extremes, which occur when the maximum possible number of outcomes are selected by chance in one group and the minimum possible in the other, and vice versa.

Further analytical methods have been proposed for taking account of non-independence between data points.4–6 These have dealt with situations, such as ophthalmology, where each individual can contribute one or two eyes to an analysis, or dental data, where several data points may be provided by each subject. These situations, and trials that include multiple pregnancies, are similar to cluster randomised trials, where groups of participants, or “clusters” (e.g. GP surgeries, villages, towns, schools or patients of a single consultant), are randomly allocated to the interventions being compared.7 In perinatal trials, each woman can be regarded as a cluster, with the number of individuals in the cluster being equal to the number of fetuses in her pregnancy. These clusters are much smaller than those in many cluster randomised trials where clusters may include hundreds or even thousands of individuals, but the principle of adjustment of the analyses to take account of similarity between members of the same cluster is exactly the same.

Donner and Klar8 present methods for analysis of cluster randomised trials, which can be directly applied to data from trials including multiple pregnancies. The calculations are relatively straightforward and can be carried out on a spreadsheet and are also implemented in the ACLUSTER software program.9 The methods include appropriate methods for calculation of the odds ratio and confidence intervals, which are most suitable for analysis of randomised controlled trials (see Appendix 1). They also provide methods for calculating adjusted χ2 statistics to take account of clustering, but these are not considered further.

Because the methods described above can give different estimates of the risk in each group, their estimates of the difference between the groups (measured by relative risk, odds ratio or risk difference) may also differ.


Three data sets were each analysed by four methods; assuming independence among babies, analysing by pregnancy, random selection and cluster trial methods. For clarity, in the results discussed here, only the odds ratio (and 95% confidence interval) is used. The results of the analyses may be slightly different using the relative risk or risk difference.

The Antenatal TRH trial (Table 1) evaluated antenatal thyrotropin releasing hormone for women at risk of preterm delivery. It recruited 225 women; 44 had a multiple pregnancy, and the total number of babies was 275. The outcome analysed here is the trial's primary outcome: known death before the end of trial data collection or need for supplemental oxygen at 28 days after birth: 34/136 babies in the TRH group and 43/139 babies in the placebo group had this outcome.

Table 1.  Summary of the three data sets used for analyses: the Antenatal TRH trial and two simulated data sets, Trials A and B.
 Number of womenNumber with zero outcomesNumber with one outcomeNumber with two outcomesNumber with three outcomes
Antenatal TRH trial
TRH group
Placebo group
Trial A
Intervention group
Placebo group
Trial B
Intervention group
Placebo group

Two simulated data sets are also analysed (Trials A and B; Table 1). These both contain data for 1000 women, 500 in the intervention group and 500 in the placebo group. In both data sets, 350 women in each group had a single pregnancy, 120 had twins and 30 had triplets. The number of babies in each group is therefore 680. In Trial A, the risk of the outcome is higher in the intervention group; it is also higher for multiple pregnancies, and there is a higher correlation between the outcomes of multiples from the same pregnancy in the intervention group. In Trial B, the overall incidence of the outcome is very similar in both groups, but the distribution differs. The risk is higher for multiples in the intervention group but not the placebo group. These data sets were chosen for illustrative purposes, as they contain features that show differences in the results depending on the methods of analysis.

Because of the possible variation in the result from the random selection method, this analysis was repeated 1000 times for each data set, to estimate the distribution of possible results. We also calculated the highest and lowest values of the odds ratio that were theoretically possible. The cluster methods were applied using ACLUSTER software.9


The results of applying the four methods to the three data sets are summarised in Table 2.

Table 2a, b & c.  Comparison of results using different methods applied to the three data sets. The four results for random selection give the expected average result, the mean of 1000 repetitions of the analysis, and the two possible extreme results. Values are given as n/N (%), mean {SD} and OR [95% CI].
MethodAntenatal TRH trial
TRH groupPlacebo groupOR (95% CI)
Assume independence between babies34/136 (25.0)43/139 (30.9)0.74 [0.42, 1.31]
Analyse by pregnancy29/112 (25.9)34/113 (30.1)0.81 [0.43, 1.51]
Random selection
Expected average result27/112 (24.1)32.33/113 (28.6)0.80 [0.55, 1.52]
Mean of 1000 repetitions  0.83 {0.04}
Extremes25/112 (22.3)34/113 (30.1)0.67 [0.48, 1.16]
29/112 (25.9)30/113 (26.5)0.97 [0.63, 1.51]
Cluster trial methods34/136 (25.0)43/139 (30.9)0.74 [0.41, 1.34]
MethodTrial A
Intervention groupPlacebo groupOR (95% CI)
Assume independence between babies169/680 (24.9)102/680 (15.0)1.87 [1.43, 2.46]
Analyse by pregnancy135/500 (27.0)90/500 (18.0)1.68 [1.25, 2.28]
Random selection
Expected average result115/500 (23.0)69/500 (13.8)1.87 (1.33, 2.63)
Mean of 1000 repetitions  1.88 {0.12}
Extremes135/500 (27.0)49/500 (9.8)3.40 (2.39, 4.85)
95/500 (19.0)90/500 (18.0)1.07 (0.78, 1.47)
Cluster trial methods169/680 (24.9)102/680 (15.0)1.87 (1.41, 2.49)
MethodTrial B
Intervention groupPlacebo groupOR (95% CI)
Assume independence between babies138/680 (20.3)136/680 (20.0)1.02 [0.78, 1.33]
Analyse by pregnancy107/500 (21.4)128/500 (25.6)0.79 [0.59, 1.06]
Random selection
Expected average result83/500 (16.6)100/500 (20.0)0.80 [0.58, 1.10]
Mean of 1000 repetitions  0.80 {0.05}
Extremes107/500 (21.4)75/500 (15.0)1.54 [1.11, 2.14]
59/500 (11.8)128/500 (25.6)0.39 [0.28, 0.55]
Cluster trial methods138/680 (20.3)136/680 (20.0)1.02 [0.78, 1.33]

As expected, the results from assuming independence and the cluster trial methods were very similar, with slightly wider confidence intervals when cluster methods were used. The impact of cluster methods would increase with increasing correlation of outcomes within multiple pregnancies, and with an increasing proportion of multiple pregnancies in the trial.

Analysis by pregnancy gave a slightly smaller effect size (i.e. an odds ratio closer to 1) than assuming independence between babies for the Antenatal TRH trial data and Trial A, but a much larger effect for Trial B. This is due to the different distributions of outcomes in the three trials. Trial B has a similar number of outcomes in each group, but in the intervention group, they are distributed among fewer women than in the placebo group. In other words, more women in the intervention group with multiple pregnancies had two or three outcomes among their babies than in the placebo group. This produces the difference between the groups when analysed by pregnancy, because the number of women with one or more outcomes differs even although the total number of outcomes does not.

Analysing the Antenatal TRH trial data set 1000 times using the random selection method gave a mean odds ratio of 0.83, with a standard deviation of 0.04. This is close to the result using the cluster trial methods (0.74). The theoretical extreme values (0.67 and 0.97) were approached closely; the highest and lowest values observed in the 1000 iterations were 0.72 and 0.97. With a larger data set, it is much less likely that values close to the extreme will occur. This is seen in the results for Trials A and B, where the highest and lowest values were some distance from the theoretical extremes. For Trial A, where the outcome is more common in the intervention group, the results were approximately centred on the result from the cluster analysis, having a mean odds ratio of 1.88. In this case, the spread of results around the mean (standard deviation of 0.12) would be unlikely to affect the conclusions of the trial, as all or almost all possible results would show a significantly higher incidence in the intervention group. However, different results may affect meta-analyses, by distorting the overall estimate of the treatment effect, or by introducing heterogeneity.

The results for the random selection method applied to Trial B were different from those for Trial A. The mean odds ratio of the 1000 iterations was 0.80 (standard deviation 0.052), which was much closer to the result from analysis by pregnancy (OR 0.79) than to the result from the cluster method (OR 1.02). The range of odds ratios was 0.63–0.99. Results at the lower end of this range may show a statistically significant reduction in the outcome in the intervention group. It is not possible to specify a cutoff value below which the odds ratio would be significantly different from 1, because this depends in part on the absolute incidence of outcomes. However, with a control group event rate of 20% (the average result for the placebo group) an odds ratio of about 0.71 or less would be statistically significant at the 5% level. About 5% of the results from random selection had an odds ratio less than this value, suggesting that for this data set, about 5% of analyses by random selection would give a significant difference between the groups. This contrasts with the result from the cluster trial methods, which show no suggestion of a difference between the groups (OR 1.02, 95% CI 0.78, 1.33).


The results of applying the analytical methods to the data sets considered here show that different methods of analysis can sometimes lead to different results, and care is needed if misleading results are to be avoided.

The choice of analytical method for the type of data considered here should be determined primarily by the question being addressed. Analysing outcomes by pregnancy addresses a different question from the other methods, and may be appropriate in some cases. For example, in a trial evaluating whether it is advantageous to reduce triplet or higher order pregnancies to twins,10 women's decisions may be influenced by their chances of having any sick or disabled children. Women's risk of one or more bad outcomes would then be of interest, and it would be reasonable to analyse by pregnancy, rather than by baby. Also, outcomes that primarily affect the mother, and would usually be the same for all babies in one pregnancy, are likely to be more appropriately analysed by pregnancy rather than by baby. An example is mode of delivery: the effects of a caesarean section on a woman with a triplet pregnancy are likely to be similar whether one or three babies are delivered by this method. Similarly, if three triplets are all delivered by caesarean, counting this as three outcomes, one for each baby, would clearly be misleading as only one caesarean was performed.

The analysis of Trial B showed that the results for random selection and assuming independence between babies can differ quite markedly. This is of concern because random selection may be used in the belief that it is better because it avoids including non-independent data. However, in some circumstances, it can give a very different answer from other methods. This is influenced by whether the risks of outcomes for single and multiple pregnancies differ between the groups. In Trial B, babies from both single and multiple pregnancies in the control group had a risk of 0.20, but in the intervention group, the risks were 0.11 for single babies and 0.31 for multiples. The different distribution of risks between the groups contributes to the difference in the results between random selection and the other methods.

A further problem of random selection is in the presentation of the results. Usually, only one of the many possible results can be presented in a publication, and readers will not know whether the result is an unusual extreme value or what the range of possible results is. This adds uncertainty to the results and makes them more difficult to interpret. Moreover, the trial is open to the criticism that the analysis has been repeated many times, producing a range of results, and the preferred result presented. It is difficult to rebut this criticism, and hence from the investigators' point of view, it may be preferable to use a method of analysis that incorporates all of the data.

Using cluster trial methods to take account of non-independence may make little difference to the trial's conclusions in many cases, but sometimes the difference may be substantial. Allowing for non-independence in the analysis is likely to become more important as the proportion of multiple pregnancies in the trial increases or if the outcomes of babies from multiple pregnancies are more closely correlated. The effective sample size may be reduced more in a trial containing a relatively small proportion of highly correlated multiple pregnancies, than in a trial containing more multiples whose outcomes are close to being independent. The correlation between babies from multiple pregnancies in the Antenatal TRH trial data was high (an intracluster correlation coefficient of 0.635), but this derives from a small sample size, and further work is necessary to determine whether values of this magnitude are typical.

Our illustrations have been limited to dichotomous outcomes, but continuous outcomes can also be analysed by cluster trial methods in a similar way. Details are given by Donner and Klar.8

Other methods apart form those discussed here could also be applied to the analysis of trials including multiple pregnancies. These include more statistically complex methods, such as multilevel modelling and generalised estimating equations.8,11 These are generally more difficult for many researchers to apply and for readers to understand, but may be most useful in situations where it is necessary to take account of covariates in the analysis, for example, in observational studies or in small randomised controlled trials where there are chance differences in baseline characteristics between the groups.

Most trial results are combined with others in systematic reviews and meta-analyses to give an overview of the evidence. Different methods of analysis of trials including multiple pregnancies present a problem, as the results presented in a publication may be influenced by the analytical method used. The differences between methods may not be large, but combining several such studies in a meta-analysis has the potential to compound the errors and to give misleading answers. Methods for meta-analysis of cluster randomised trials have been developed and could be applied to data from trials incorporating multiple pregnancies.12 However, they require information on all babies of all randomised women, which may not be reported if trials are analysed by pregnancy or by random selection. At a minimum, systematic reviews should endeavour to ensure consistency of analytical methods among the trials they include; again, this may be difficult if the outcomes for some of the data are not reported or if the analytical method is not specified. The extent to which the conclusions of systematic reviews may be affected by different analytical methods is currently unknown.


Some of the methods that have been used for analysis of randomised trials including multiple pregnancies appear to have disadvantages. Moreover, not reporting outcomes for some of the babies in the trial may have implications for subsequent systematic reviews and meta-analysis. A reasonable general recommendation for outcomes that are most appropriately analysed by baby rather than by pregnancy would be to calculate the odds ratio or relative risk assuming independence among babies and to adjust the confidence intervals using one of the cluster trial methods. For future meta-analyses, it is also important to report the number of women with multiple pregnancies and the number of outcomes among women with single, twin, triplet and higher order pregnancies.


The authors thank Allan Donner for advice about the application of cluster trial methodology to perinatal trials during the development of this work.

Accepted 27 October 2003


Appendix 1

This appendix gives the formulae necessary for performing an analysis using the cluster trial methods. The methods are described fully by Donner and Klar,8 and the notation used here follows their notation. The methods are also implemented in the ACLUSTER statistical program,9 which is distributed by Metaxis, Summertown Pavilion, Middle Way, Oxford OX2 7LG, UK.


ki number of clusters (i.e. women) randomised to group i (where i= 1 or 2)

mij number of individuals in cluster j of group i (i.e. the number of babies of woman j in group i)

Mi total number of babies in group i

Yi total number of outcomes in group i

Pi event rate over all clusters in group i ( =Yi/Mi)

M total number of babies in the study

K total number of women in the study

Calculation of odds ratio

The odds ratio (OR) is calculated as:


Calculation of confidence interval

To calculate a 95% confidence interval, the natural logarithm of the odds ratio (ln(OR)) and its variance must first be calculated:


where C1 and C2 are calculated as follows:


This requires calculation of the intracluster correlation coefficient ρ:




The confidence limits are calculated as:


For 95% confidence limits, the value of Zα/2 is 1.96.