Response to: ‘Power dressing and meta-analysis: incorporating power analysis into meta-analysis’ by S. Muncer (2002) Journal of Advanced Nursing 38, 274–280

  1. Top of page
  2. Response to: ‘Power dressing and meta-analysis: incorporating power analysis into meta-analysis’ by S. Muncer (2002) Journal of Advanced Nursing 38, 274–280

I am very concerned that a paper with such potential for misleading your readers should have been published: it seems to me that this paper, which contains several significant errors, could do a great deal of harm.

The general message of the paper, if I understand it correctly, is that meta-analysis suffers from a serious publication bias, in that the studies which result in positive findings are far more likely to be published than those with negative findings. This is undeniably true, and is a well-known major flaw in meta-analysis in general. The authors suggest that it is the smaller studies which are most likely to have the negative findings, because unless an effect or difference is really gross a small study is unlikely to find it. They suggest the remedy of only including in meta-analysis those studies which are reasonably large, thereby excluding all the small and under-powered studies, whether they found a positive effect or not. They cite a paper by Kraemer et al. (1998) which makes precisely this point. The authors have a different approach, however. They propose combining all the studies in the meta-analysis to find a weighted mean effect size, and examining all the included studies to see whether they would be large enough to have a reasonable power to detect that effect size. Those that would are retained; those that would not are removed. The mean effect size is then recalculated, on the basis of those studies that remain in the meta-analysis, and the exercise is repeated. They would carry on with this process iteratively until all the studies are retained. The flaw of this process is that if, as the authors suggest, most of the larger effects are found in small studies, these would be eliminated first, the larger studies with smaller apparent effect sizes could be eliminated next, and they could finish up eliminating all the studies and hence having no scope for a meta-analysis at all. In other words, their algorithm could over-correct for the problem, and prevent many meta-analyses from taking place. (Some may think that would be a good thing, but others would be horrified at such a step.)

However, if the authors' message is as I have suggested above, it is at least an argument worth making, and worthy of examination. What deflects from the argument is the incidental material in the paper, which is either irrelevant or, as I shall attempt to show, just plain wrong.

The paper gets off to a bad start in the abstract, where power is defined as ‘the probability that [the study] will lead to a statistically significant result’– this is only half the truth. The power is the probability that it will lead to a statistically significant result, on the assumption that the alternative hypothesis is true – and therefore the power will vary in value according to what the alternative hypothesis is. This alternative is normally expressed as a specific effect size or specific degree of relationship. The paper makes no comment on this at all. Also in the abstract, we are told that ‘a simple means of calculating an easily understood measure of effect size from a contingency table is demonstrated’ in the paper; but in fact, all we are shown is a very elementary way of converting a chi-squared value from a 2 × 2 contingency table into a correlation coefficient, and this is not the same thing as an effect size at all.

The bad start continues when in the left column of p. 275 the paper defines an odds ratio as the ratio of the probability of occurrence to the probability of non-occurrence. This is the definition of odds; an odds ratio is the ratio of two odds, one for one set of conditions divided by that for another. So, for example, if with treatment A one half of people recover from a disease, and with treatment B one-fifth of people recover, the odds for A are (1/2)/(1/2) = 1, and for B (1/5)/(4/5) = 1/4. So the odds ratio for treatment A against treatment B is 1/(1/4) = 4. To commit such an elementary error makes one wonder about the authors' expertise in statistical terms.

In the same paragraph, the authors say that it is possible to convert a chi-squared value into a correlation coefficient r, but this is only true in the case of a 2 × 2 table such as that shown on p. 278 (where, incidentally, and to give credit where it is due, they have carried out the conversion correctly). We are also told that it is possible to convert a correlation coefficient into an effect size d; but we are not told how this can be done. Indeed, in their later discussion at the top of p. 279, the authors seem to treat a correlation coefficient as if it were an effect size. A correlation coefficient of 0·1 is not the same thing as a standardized effect size of 0·1; and a correlation coefficient of 0·1 does not mean a 10% improvement. The reference to an effect size r in that paragraph makes me wonder if the authors got themselves hopelessly confused; the reader certainly would be.

At the end of the same paragraph on p. 275, we have a statement that ‘it is difficult to interpret confidence intervals for odds ratios’, simply because of the lack of symmetry of those confidence intervals (CIs). But if I know that the 95% CI for the odds ratio of treatment A against treatment B extends from, say, 0·77–2·13, then I know immediately that it is perfectly easy to believe that the two treatments do not differ in effect (because the interval includes 1). There is, however, some evidence that A is better than B because the lower limit is greater than 1/2·13 so that on its logarithmic scale the CI extends further above 1 than it does below it, and that for example a true odds ratio of 2 in favour of A would be quite easy to believe, but a true odds ratio of 3 in favour of A would be much less believable (because 2 lies within, but 3 outside, that CI). What is difficult about that?

Again, on p. 276, we are treated to an example which ‘may help to explain how sample size and effect size influence the power of an experiment’. Well, the example they use doesn't do anything of the sort. In fact, it confuses the issue totally. The null hypothesis in this example is that the balls are all of the same colour, and the alternative hypothesis presumably is that they are not all the same. (It is striking that the entire paper does not use the phrase ‘alternative hypothesis’ once, and yet the concept of power is meaningless unless you have an alternative hypothesis and know what it is.) The first test the authors propose is to take two balls randomly from the bag; if they are the same colour accept the null hypothesis, otherwise reject it in favour of the alternative. In this test, the significance level is the probability that the balls chosen are of different colours if the null hypothesis is true, and that is clearly zero (if the balls were all the same colour, the two selected cannot be different). The power is the probability that the balls are of different colours if the alternative is true. If that alternative hypothesis is, as they suggest, that seven are black and three white, then the power is 42/90. But if we take their second alternative hypothesis, namely five black and five white, then the power becomes 50/90, which is higher than in the first case! (This is hardly surprising, as 7 black and 3 white is closer to all the balls being the same colour than 5 black and 5 white, so clearly the latter condition would be easier to detect.) So the example seems to go against what the authors claim to be trying to show.

In this example, what the authors mean when they refer to ‘a smaller effect of colour on marbles in the bag’ I cannot imagine. The colour of the balls is the observed variable, not an independent variable, and so cannot have an effect on anything! It would be analogous to speak of blood pressure having an effect on patients, in a trial where in reality the purpose is to determine whether the treatment is having an effect on blood pressure.

On p. 278, we are told that from Table 1 it is ‘clear’ that the effect sizes between the various studies reported there are quite similar, because six of the seven are in the same direction and the mean and median effect sizes are similar. Well, the similarity of the mean and median only tells us that the variation in the values is roughly symmetrical: the variation could be huge, however, and still give very similar mean and median. The mean and median of the numbers 2, 2·2, 2·4, 2·6, 2·8 and 3 are the same; so are the mean and median of the numbers −100, 0, 100, 200, 300 and 400; the first set are very similar, the second set very different. So what the authors say here is nonsense.

Finally, on p. 279, the authors wish to ‘clarify their position’ regarding use of CIs in meta-analysis. They then proceed to criticize the use of CIs, as they take no account of the power of the study. But this statement shows particularly muddled thinking, because the concept of power refers to the probability of accepting an alternative hypothesis and therefore only applies in a hypothesis-testing situation. Where CIs are used, there is no question of hypothesis testing; CIs are usually an alternative (most statisticians think, a better alternative) to hypothesis testing. So when you use a CI, the question of power cannot normally arise. Therefore, the statement that ‘there is no way of knowing the power of a study from calculating a confidence interval’ is superfluous: in fact, CIs are more informative because they automatically convey a measure of the precision of estimate, in the form of their width. The wider the CI, the less precise it is, and therefore the greater likelihood of important differences or effects being missed. Are the authors using some definition of power different from that which statisticians have been using for the past 70 years or more?

In short, this paper loses any credibility which its central message may have by its muddled and improper use of statistical terms and ideas. My main concern is that nurses who, as I know from my teaching experience, tend to struggle with statistics, will read this article and finish up so confused that their self-confidence with quantitative approaches will actually be reduced. This means that the paper could do a great deal of harm.

I am sorry to be so negative about this paper. I am glad to see that JAN is publishing papers on statistical aspects of research, which are very important and from which many aspiring nurse researchers shy away. But this paper will only make matters worse, and that is a great pity. The paper represents a valuable opportunity lost.