Summary
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
1. Metaanalysis is a powerful and informative tool for basic and applied research. It provides a statistical framework for synthesizing and comparing the results of studies which have all tested a particular hypothesis. Metaanalysis has the potential to be particularly useful for ecologists and evolutionary biologists, as individual experiments often rely on small sample sizes due to the constraints of time and manpower, and therefore have low statistical power.
2. The rewards of conducting a metaanalysis can be significant. It can be the basis of a systematic review of a topic that provides a powerful exploration of key hypotheses or theoretical assumptions, thereby influencing the future development of a field of research. Alternatively, for the applied scientist, it can provide robust answers to questions of ecological, medical or economic significance. However, planning and conducting a metaanalysis can be a daunting prospect and the analysis itself is invariably demanding and labour intensive. Errors or omissions made at the planning stage can create weeks of extra work.
3. While a range of useful resources is available to help the budding metaanalyst on his or her way, much of the key information and explanation is spread across different articles and textbooks. In order to help the reader use the available information as efficiently as possible (and so avoid making timeconsuming errors) this article aims to provide a ‘road map’ to the existing literature. It provides a brief guide to planning, organizing and implementing a metaanalysis which focuses more on logic and implementation than on maths; it is intended to be a first port of call for those interested in the topic and should be used in conjunction with the more detailed books and articles referenced. In the main, references are cited and discussed with an emphasis on useful reading order rather than a chronological history of metaanalysis and its uses.
4. No prior knowledge of metaanalysis is assumed in the current article, though it is assumed that the reader is familiar with anova and regressiontype statistical models.
Introduction: the foundations of metaanalysis
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
A literature review for any given topic is likely to turn up a long list of studies, with varying degrees of consistency in experimental methodology, study species and analytical approach. Often these studies have led to very different conclusions. For example, theoreticians working on the evolution of biparental care have predicted that it is only an evolutionarily stable strategy if individuals respond to a decrease in parental care effort by their mate with an increase of smaller magnitude in their own care effort. Over the last 25 years, many behavioural ecologists have performed experiments to test whether partial compensation is indeed observed if one member of a breeding pair is removed or handicapped to reduce its care input. These studies have been carried out on birds, rodents and insects and have reported every possible response to experimental manipulation, from desertion to overcompensation for the lost care effort (reviewed in Harrison et al. 2009). Given the variability in how these studies were conducted and the often small individual sample sizes, it is almost impossible to decide if the literature as a whole supports the partial compensation hypothesis simply by reading and contrasting studies. However, metaanalysis provides a formal statistical framework with which we can rigorously combine and compare the results of these experiments. In this article, I will outline the logic of metaanalysis and provide a brief guide to planning, organizing and implementing a metaanalysis. The article is intended to serve as a ‘road map’ to the numerous detailed resources which are available, providing an introduction which focuses more on logic and implementation than on mathematics. A glossary of key terms (marked in bold in the main text) is provided in Box 1 and key references are listed at the end of each section.
Metaanalysis gives us quantitative tools to do two things. First, if a number of attempts have been made to measure the effect of one variable on another, then metaanalysis provides a method to calculate the mean effect of the independent variable, across all attempts. Usually, the independent variable represents some form of experimental manipulation (treated vs. control groups, or a continuous variable representing treatment level). To illustrate this, FernandezDuque & Valeggia (1994) combined the results of five studies of the effect of selective logging on bird populations. This revealed a detrimental effect of selective logging on population density that was not immediately apparent from simply looking at the results of the individual studies. Secondly, metaanalysis allows us to measure the amount of experimentallyinduced change in the dependent variable across studies and to attempt to explain this variability using defined moderator variables. Such variables could reflect phylogenetic, ecological or methodological differences between study groups. For example, in a classic metaanalysis of 20 studies, Côté & Sutherland (1997) calculated that, on average, predator removal resulted in an increase in postbreeding bird populations but not in breeding populations.
Metaanalysis achieves these goals by using effect sizes: these are statistics that provide a standardized, directional measure of the mean change in the dependent variable in each study. Effect sizes can incorporate considerations of sample size. Furthermore, when being combined in a metaanalysis, effect sizes can be weighted by the variance of the estimate, such that studies with lower variance (i.e. tighter estimated effect size) are given more weight in the data set. Because variance decreases as sample size increases, this generally means that effect sizes based on larger study populations are given greater weight. Tests which are analogous to analysis of variance (anova) and weighted regression can then be applied to the population of effect sizes to identify dependent variables that explain a significant amount of variation between studies. For instance, in our metaanalysis, we found that the mean response to partner removal or handicapping was indeed partial compensation and that the sex of the responding partner and aspects of the experimental methodology explained some of the variation between individual studies (Harrison et al. 2009).
Metaanalysis vs. vote counting
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
How many of us have heard the results of null hypothesis significance tests (NHST) being referred to as showing ‘strong’ or ‘weak’ effects of some variable, based on the size of the calculated Pvalue? It is a common fallacy to assume that the smaller the Pvalue, the stronger the observed relationship must be. However, the magnitude of an effect and its statistical significance are not intrinsically correlated: a small Pvalue does not necessarily mean that the effect of experimental treatment is large, or that the slope of a variable of interest on some covariate is steep. This is due in large part to the dependence of P on sample size: given a large enough sample size, the null hypothesis will almost always be rejected. Pvalues reflect a dichotomous question (is the observed pattern of data likely to be due to chance, or not?) not an openended one (how strong is the pattern in the data?). Cohen (1990) uses a rather wonderful example to demonstrate this point: he cites a study of 14 000 children that reported a significant link between height (measured in feet and adjusted for age and sex) and IQ. He then points out that if we take ‘significant to mean a Pvalue of 0·001, then a correlation coefficient of at least 0·0278 – a very shallow slope indeed – would be found to be significant in a sample this large. The authors actually reported a rather larger correlation coefficient of 0·11, but the effect of height on IQ is still small: converting height to a metric measure, this means that a 30point increase in IQ would be associated with an increase in height of over a metre.
The pros and cons of NHST and its alternatives have been discussed by other authors (Nakagawa & Cuthill 2007; Stephens, Buskirk & del Rio 2007) and are beyond the scope of this article: suffice it to say that Pvalues from NHST do not measure the magnitude of the effect of independent variables on dependent variables, are heavily influenced by sample size and are not generally comparable across studies. In other words, Pvalues are not effect sizes: two studies can have the same effect size but different Pvalues, or vice versa.
This means that post hoc analyses that rely on ‘vote counting’ of studies with significant and nonsignificant results are not very reliable. Vote counting has been a common method of determining support for a hypothesis, is often used in the introduction or discussion sections of empirical papers to provide an overview of the current state of a field or to justify new work, and is sometimes published under the erroneous title of metaanalysis. No quantitative estimate of the effect of interest is provided by vote counting. Furthermore, vote counting lacks statistical power for two reasons. First, the effect of sample size on Pvalue means that real but small effects may have been obscured by small sample size in the original studies. Secondly, simply counting votes with no attention to effect magnitude or sample size does nothing to rectify this lack of power. A formal metaanalysis ameliorates this problem. Not only are effect sizes more informative, they also represent continuous variables that can be combined and compared. A more subtle point is that NHST focuses on reducing the probability of type I errors (rejecting the null hypothesis when it is in fact true). Type II errors (failing to reject the null hypothesis when it is false) are not so tightly controlled for and this type of error can be of particular concern in fields such as conservation or medicine, where failing to detect an effect of, say, pesticide use on farm bird populations, could be more harmful than a type I error. By definition, any method that increases the power of a test reduces the likelihood of making a type II error.
There are three commonlyused types of statistic that give reliable and comparable effect sizes for use in metaanalysis. All can be corrected for sample size and weighted by withinstudy variance. For studies that involve comparing a continuous response variable between control and experimental groups, the mean difference between the groups can be calculated. For studies that test for an effect of a continuous or ordinal categorical variable on a continuous response variable, the correlation coefficient can be used. Finally, for dichotomous response variables the risk ratio or odds ratio provides a measure of effect size. Once a population of effect sizes has been collected, it is possible to calculate the mean effect size and also a measure of the amount of betweenstudy variation (heterogeneity, Q) in effect size.
We might therefore say that metaanalysis is more clearly needs driven and evidence based than simple vote counting. Box 2 provides a simple demonstration of how metaanalysis works. It should be noted that, like any statistical method, metaanalyses are only as good as the data used and can still suffer from both type I and type II errors: this is dealt with in more detail in the discussion of metaanalytic methods below.
A ‘to do’ list for metaanalysis
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
At this point, it would be useful to outline the steps required to begin and carry out a metaanalysis.
 1
Perform a thorough literature search for studies that address the hypothesis of interest, using defined keywords and search methodology. This includes searching for unpublished studies, for example by posting requests to professional newsletters or mailing lists.
 2
Critically appraise the resulting studies and assess whether they should be included in the review. (Are they applicable? Is the study methodology valid? Do you have enough information to calculate an effect size?) Record the reasons for dropping any studies from your data set.
 3
Choose an appropriate measure of effect size and calculate an effect size for each study that you wish to retain.
 4
Enter these studies into a master data base which includes study identity, effect size(s), sample size(s) and information which codes each study for variables which you have reason to believe may affect the outcome of each study, or whose possible influence on effect size you wish to investigate (experimental design, taxonomic information on the study species, geographic location of study population, lifehistory variables of the species used etc). You should also record how you calculated the effect size(s) for each study (see below).
 5
Use metaanalytic methods to summarize the crossstudy support for the hypothesis of interest and to try to explain any variation in conclusions drawn by individual studies.
 6
Assess the robustness and power of your analysis (likelihood of type I and type II errors).
Steps 1 and 2 reflect the fact that metaanalysis sits within the general methodological framework of the systematic review. Cooper, Hedges & Valentine (2009) argue that research synthesis based on systematic reviews can be viewed as a scientific discipline in its own right. As they rightly stress, a good systematic review follows exactly the same steps as an experiment: a problem is identified, a hypothesis or hypotheses formulated, a method for testing the hypothesis designed and, once applied, the results of this method are quantitatively analysed. The method itself can then be criticized. These steps allow the goals of systematic review in general, or metaanalysis in particular, to be met. It is difficult to argue that a review has usefully contributed to a field – whether it be by providing critical analysis of empirical results, highlighting key issues or addressing a conflict – if the review itself does not have a firm basis in a defined methodology for identifying, including and extracting information from the sources reviewed. A notable proponent of the systematic review approach in ecology, Stewart (e.g. Stewart, Coles & Pullin 2005; Pullin & Stewart 2006; Roberts, Stewart & Pullin 2006) has provided guidelines relevant to this field.
The keys to making metaanalysis as stressfree as possible are organization and planning. In particular, your list of potential moderator variables (step 4) should be clearly defined before you begin: it is far preferable to produce a data base which includes information that you later decide not to use, than to produce a data base that excludes a variable you later decide to explore, as the latter may require a second (or third, or fourth) trawl through your collection of studies to extract the necessary information. In the present article, I will now concentrate on the mechanics of carrying out a metaanalysis (steps 3, 5 and 6).
Choosing an appropriate effect size statistic
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
A meaningful measure of effect size will depend on the nature of the data being considered. Experimental and observational studies in ecology and evolution generally generate data that falls into one of three categories, and this determines which indices of effect size are appropriate. All of the indices of effect size outlined below have known sampling distributions (generally they are normalized) and this allows us to calculate their standard errors and construct confidence intervals.
 1
Continuous or ordinal data from two or more groups. Data in this form are exemplified by treatment vs. control group comparisons and are generally presented and analysed using averages and measures of variance (mean and standard deviation, median and interquartile range, etc). In such cases, a measure of the difference between the group means is an appropriate effect size. The raw difference in means can be standardized by the pooled standard deviation; two commonlyused measures of standardized mean difference are Cohen's d and Hedges’g: these differ in the method used for calculating the pooled standard deviation but it should be noted that the d and g notation has been used interchangeably by some authors. Alternatively, when the data measure rates of change in independent groups (e.g. plant growth response in normal or elevated CO_{2}, body mass gain after supplementary feeding), the response ratio can be used. This measures the ratio of the mean change in one group to the mean change in the other. Like the standardized mean difference, it takes the standard deviations in the two groups into account. The response ratio is generally logtransformed prior to metaanalysis in order to linearize and normalize the raw ratios.
 2
Continuous or ordinal data which are a response to a continuous or ordinal independent variable. Any data which are analysed using correlation or regression fall into this category. In this case, the correlation coefficient itself can be used as a measure of effect size. Generally, we are interested in a simple bivariate relationship (say, the effect of average daily rainfall on the laying date of great tits), and it may be that the studies in our data set also explore such a relationship. If a study reports the results of statistical tests which include other variables (such as average daily temperature during the breeding season), then we might use the partial correlation coefficient: the effect of rainfall on lay date if temperature is held fixed. (It may be that this is the only effect size we can calculate from the data available; if the published data allow us to calculate the simple bivariate correlation between rainfall and laying date, ignoring temperature, then we could argue that it would be better to use this as our effect size as it would be more directly comparable with the bivariate correlation coefficients retrieved from the other studies). Whichever type of correlation coefficient we use, Fisher's z transformation is generally applied in order to stabilize the variance among coefficients prior to metaanalysis.
 3
Binary response data. Data that take the form of binary yes/no outcomes, such as nest success or survival to the next breeding season, are generally analysed using logistic regression or a chisquared test. In this case, an appropriate measure of effect size is given by calculating the risk ratio or odds ratio. These types of effect size have been very rarely used in ecology and evolution, though they are common in medical research.
In some cases, more than one type of effect size can be meaningful. For instance, if an experiment involves applying some quantitativelymeasurable level of treatment to the experimental group, then the experimental and control groups could meaningfully be compared using either standardized mean difference or a correlation coefficient. If different studies have applied different levels of the treatment to their experimental groups, the latter may be preferable. The ‘best’ measure of effect size must be judged based on its compatibility with the available raw data and its ease of interpretation.
Much of the labour in conducting a metaanalysis lies in calculating individual effect sizes for studies. All of the effect sizes mentioned above can be calculated from reported means, variances, SEs, correlation coefficients and frequencies. If these are not available then effect sizes can be calculated from reported t, F or Chisquared statistics or from Pvalues. The exact formulae for calculating effect sizes from these data differ depending on the nature of the statistical tests and experimental designs from which they were taken (e.g. paired vs. unpaired ttest). This is explained rather thoroughly by DeCoster (2004) and Nakagawa & Cuthill (2007). In general, the more directly you can calculate an effect size – the less you have to infer by using test statistics and reconstructed statistical tables – the less error will be incorporated into your estimate of the effect size. It is also possible to convert between different measures of effect size.
While the actual mathematics of converting reported data into effect sizes is rendered fairly straightforward thanks to freelyavailable Microsoft Excel files and metaanalytic software packages, actually harvesting the necessary data from a library of published studies can be painstaking work. The number of studies that have to be discarded due to an inability to calculate a meaningful effect size based on the information available can be surprisingly high. Studies that do not give variance statistics, do not clearly state which statistical tests were used or even do not make sample size explicit all create headaches for the wouldbe metaanalyst.
Key references The textbooks referenced above all outline various effect size calculations, as do Hillebrand (2008), DeCoster (2004) and Nakagawa & Cuthill (2007). Hedges, Gurevitch & Curtis (1999) provide an introduction to the use of response ratios for ecological data and Schielzeth (2010) presents a thoughtful and interesting perspective on the calculation and presentation of correlation coefficients. Lipsey & Wilson (2002) helpfully provide an Excel spreadsheet for calculating effect size, which complements the information provided in the Appendices of their textbook; a similar spreadsheet is provided by Thalheimer & Cook (2002). The software packages outlined in ‘Partitioning and expaining heterogeneity’ can also calculate effect sizes from summarized data.
Criticizing metaanalysis
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
The robustness and utility of metaanalysis – and the reliability of any inferences drawn from it – are determined ultimately by the population of individual studies used. First, issues surrounding which studies can and should be included in a metaanalysis should be mentioned. Secondly, it would be useful to have some way of determining the likelihood of a significantly nonzero mean effect size being the result of a type I error and, conversely, the likelihood of a zero mean effect size being the result of a lack of statistical power rather than a reliable reflection of the true population mean effect size. The number and identity of studies used, as well as their individual sample sizes, will affect type I and II error rate in metaanalysis.
Which studies can be combined in a metaanalysis?
Step 2 in the ‘to do’ list reflects the fact that including methodologically poor studies in the data set may add more noise than signal, clouding our ability to calculate a robust mean effect size or to identify important moderator variables. Defining and reporting the criteria by which studies were assessed for inclusion is therefore an essential part of the metaanalytic method. Furthermore, thought must be given as to whether the studies under consideration may sensibly be combined in a metaanalysis – do the effect sizes calculated from the population of studies all reflect the same thing? For instance, both feeding offspring and providing thermoregulation by means of brooding or incubating are types of parental care, but in our metaanalysis (Harrison et al. 2009) we considered these two types of care separately. The effect sizes for the two types of care were significantly different as defined by a Q test, but more fundamentally there is no reason to assume that these behaviours have the same cost : benefit ratios for parents: therefore, we felt that combining their effect sizes would be an example of ‘comparing apples and oranges’– a criticism that has often been levelled at metaanalysis. This consideration is probably more pertinent to ecologists than to, say, medical researchers, as response variables and study designs vary more widely in our field.
It is also worth noting that individual studies may act as outliers in a metaanalytic data set, having a very large influence on the mean effect size. It is possible to identify such studies by means of a leaveout analysis: each of our N studies is dropped from the data set in turn and a set of estimates of the mean effect sizes from the N1 remaining studies is calculated. Software such as the aforementioned MetaAnalyst can perform an automated leaveout analysis and so flag highly influential studies. How to deal with such a study is then a matter for personal consideration; depending on the nature of the study (sample size, experimental protocol, apparent methodological quality), the metaanalyst must decide whether it is justifiable to leave it in the data set, or better to remove it. If it is retained, then it would be advisable to report the effect of dropping this study on the conclusions.
We must also consider potential sources of nonindependence in the data set. Nonindependence has already been mentioned in the context of moderator variables such as research group, and in the context of phylogenetic metaanalysis. However, nonindependence can also result from more than one effect being measured on each individual or replicate in a study. For example, if we have data on reproductive success and survival in control and experimental groups, then including the whole population of effect sizes in a single analysis not only raises the issue of a potential ‘apples and oranges’ comparison, but also creates nonindependence as a result of measures from the same individuals being correlated. In this scenario, arguably the best strategy is to conduct separate metaanalyses of effects on reproduction and survival. Nonindependence also rears its head in another form if we test the same set of studies over and over for the effects of different moderator variables. This will compromise the reliability of our significance tests and increase the type I error rate.
Publication bias
The biggest potential source of type I error in metaanalysis is probably publication bias. A funnel plot of effect size vs. study size is one method of identifying publication bias in our set of studies: all thing being equal, we would expect that the effect sizes reported in a number of studies should be symmetrically distributed around the underlying true effect size, with more variation from this value in smaller studies than in larger ones. Asymmetry or gaps in the plot are suggestive of bias, most often due to studies which are smaller, nonsignificant or have an effect in the opposite direction from that expected having a lower chance of being published. A more thorough discussion of publication bias is provided by Sutton (2009). For the purposes of this article, suffice it to say that time spent uncovering unpublished data relevant to the hypothesis in question, as suggested in the ‘to do’ list above, is highly recommended.
Even if we discover and include some unpublished studies and produce a funnel plot with no glaring gaps, it would still be informative if we could work out the number of nonsignificant, unpublished studies that would have to exist, lying buried in file drawers and field notebooks, in order to make us suspect that our calculated mean effect size is the result of a type I error. This is termed the failsafe sample size and various simple, backofanenvelope methods have been suggested for calculating it, based on the number of studies included, their effect sizes and some benchmark minimal meaningful effect size. The larger the failsafe sample size, the more confident we can be about the representativeness of our data set and the robustness of any significant findings. However, Rosenberg (2005) makes the important point that suggested methods for calculating the failsafe sample size are overly simple and likely to be misleading, in the main because they do not take into account the weighting of individual studies in the metaanalytic data set – a curious omission, given that weighting is one of the key strengths of metaanalysis. He outlines a method for calculating the failsafe sample size which is arguably more explicitly ‘metaanalytic’ in its calculation.
The reader should therefore be aware that the utility of failsafe sample size calculations is still debated. Jennions, Møller & Hunt (2004) and Møller & Jennions (2001) provide an interesting discussion of publication bias and type I errors in metaanalysis. These authors stress the point that metaanalysis involves (or should involve) explicit consideration of publication bias and attempts to minimize its influence, and that this should primarily consist of seeking unpublished studies (as opposed to post hoc calculations). If I may venture a tentative opinion, I would suggest that a report of failsafe sample size is worth including in published metaanalyses, but it is no substitute for a thorough search for unpublished data and should be interpreted as only a rough reflection of the likely impact of any publication bias.
Power
As discussed above, type II errors often concern us more than type I errors. If our mean effect size is not significantly different from zero, if no significant heterogeneity is found among studies, or if a moderator variable is concluded to have no effect on effect size, how can we start to decide if this is simply due to a lack of statistical power? Evaluating the power of metaanalytic calculations is rather more complex as it depends on both the number of studies used and their individual sample sizes, which are related to the withinstudy component of variance in effect size. Hedges & Pigott (2001, 2004) provide detailed guides to power calculations for metaanalysis. In the present article, I will limit the discussion of power to the observation that small studies which in themselves have low statistical power might add more noise than signal to a metaanalytic data set and thus reduce its power: the benefits of excluding studies with very small sample size should be seriously considered, and can be quantified by calculating the power of a metaanalytic data set that either includes or excludes such studies.
Closing remarks
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
Metaanalysis is a great tool for extracting as much information as possible from a set of empirical studies. The potential advantages of sharing and combining data in this way are, I hope, evident from the discussion in this article. Organizing and carrying out a metaanalysis is hard work, but the fruits of the metaanalyst's labour can be significant. In the best case scenario, metaanalysis allows us to perform a relatively powerful test of a specific hypothesis and to draw quantitative conclusions. A lowpowered analysis based on a small number of studies can still provide useful insights (e.g. by revealing publication bias through a funnel plot). Finally, by revealing the magnitude of effect sizes associated with prior research, metaanalysis can suggest how future studies might best be designed to maximize their individual power.
Most journals now include in their instructions to authors a sentence to the effect that effect sizes should be given where appropriate, or that at least the necessary information required for rapidly calculating an effect size should be provided. The lack of this information is common, but will not necessarily be noticed by the authors, interested readers or peer reviewers. For example, when conducting our metaanalysis on parental care (Harrison et al. 2009), it was only on specifically attempting to calculate effect sizes that we noticed a small number of published articles where the sample sizes used were not clear. Doublechecking that sample sizes are stated explicitly and that exact test statistics and Pvalues are stated should not add significantly to the burden of writing up a research article and will add value to the work by allowing its ready incorporation to a metaanalysis if required. On a more positive note, we received many rapid and positive responses from colleagues whom we contacted to ask for clarification, extra data or unpublished data. There is clearly a spirit of cooperation in ecology and evolution which can lead to the production of useful and interesting syntheses of key issues in the field.
Box 1: Glossary
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
Effect size: A standardized measure of the response of a dependent variable to change in an independent variable; often but not always a response to experimental manipulation. Effect sizes could be thought of as Pvalues that have been corrected for sample size and are the cornerstone of metaanalysis: they make statistical comparison of the results of different studies valid. Commonlyused effect size measurements are the standardized mean difference between control and experimental groups, correlation coefficients and response ratios.
Failsafe sample size: If we calculate a mean effect size across studies and it is significantly different from zero, the failsafe sample size is the number of unpublished studies with an effect size of zero that would have to exist in order to make our significant result likely to be due to sampling error rather than any real effect of the experimental treatment. i.e. the bigger this value, the smaller the probability of a type I error. The utility of failsafe sample sizes is debated.
Heterogeneity: A measure of the amongstudy variance in effect size, denoted Q. Just as anovatype statistical analyses partition variance between defined independent variables and error to perform significance tests, metaanalysis can partition heterogeneity between independent variables of interest and error.
Metaanalysis: A formal statistical framework for comparing the results of a number of empirical studies that have tested, or can be used to test, the same hypothesis. Metaanalysis allows us to calculate the mean response to experimental treatment across studies and to discover key variables that may explain any inconsistencies in the results of different studies.
Null hypothesis significance testing: ‘Traditional’ statistical tests are tools for deciding whether an observed relationship between two or more variables is likely to be caused simply by sampling error. A test statistic is calculated based on the variance components of the data set and compared with a known frequency distribution to determine how often the observed patterns in the data set would arise by chance, given random sampling from a homogeneous population.
Power: The ability of a given test using a given data set to reject the null hypothesis (at a specified significance level) if it is false. i.e. as power increases, the probability of making a type II error decreases.
Type I error: Rejecting the null hypothesis when it is true (see Failsafe sample size).
Type II error: Failing to reject the null hypothesis when it is false (see Power).
Box 2: The power of metaanalysis
 Top of page
 Summary
 Introduction: the foundations of metaanalysis
 Metaanalysis vs. vote counting
 A ‘to do’ list for metaanalysis
 Choosing an appropriate effect size statistic
 Analysing your data
 Criticizing metaanalysis
 Closing remarks
 Box 1: Glossary
 Box 2: The power of metaanalysis
 Acknowledgements
 References
Imagine that a novel genetic polymorphism has been discovered in a species of mammal. It has been hypothesized that the ‘mutant’ genotype may affect female lifetime reproductive success (LRS) relative to the wild type. Twelve groups of researchers genotype a number of females and record their LRS. Each group studies equal numbers of wildtype and mutant females, with total sample sizes ranging from 18 to 32 animals. Six of the studies were carried out on one longterm study population in habitat A and six on a second in habitat B.
Unknown to the researchers, there is a habitatdependent effect of genotype on female LRS. Across the whole species wildtype females produce on average 5·0 ± 2·0 offspring that survive to reproductive age. In habitat A, mutant females also produce 5·0 ± 2·0 offspring, but in habitat B mutant LRS is increased to 5·8 ± 2·0 offspring. The standardized mean difference in female LRS is, therefore, zero in habitat A and 0·4 in habitat B.
The results of the imaginary studies are given in the table below and are based on random sampling from normal distributions with the specified means and standard deviations (Table 1). For each study, LRS (mean and SD) is given for each genotype, along with the sample sizes and the Pvalue resulting from a ttest. Based on ttests, three studies reported a significant effect of the mutant allele on LRS.
Table 1. Results of 12 studies investigating effect of genotype on LRS Study  Habitat  Wildtype LRS  Mutant LRS  P (twotailed) 

Mean  SD  N  Mean  SD  N 


1  A  4·71  1·55  10  5·28  2·74  10  0·575 
2  A  4·5  1·51  12  5  2·23  12  0·470 
3  A  5·12  1·94  14  4·79  2·1  14  0·671 
4  A  4·92  2·01  11  4·68  1·35  11  0·745 
5  A  4·93  1·65  15  4·19  2·24  15  0·312 
6  A  5·08  2·36  16  5·11  1·58  16  2·122 
7  B  4·71  1·15  12  5·94  1·66  12  0·047 
8  B  4·9  1·7  10  6·01  1·22  10  0·110 
9  B  5·1  2·05  9  5·83  1·81  9  0·257 
10  B  4·77  1·78  16  6·99  1·7  16  0·001 
11  B  3·92  2·55  14  5·84  1·94  14  0·034 
12  B  4·99  1·6  15  5·72  1·66  15  0·229 
Can we use metaanalytic techniques to combine these data and gain quantitative estimates for the size of the effect of genotype on LRS? Fig. 1 shows the calculated mean effect size (Cohen's d) for each study, with their 95% confidence intervals. The 95% confidence interval for the weighted mean effect size across all twelve studies is (0·06, 0·64), suggesting that the mutation does indeed increase LRS. Furthermore, if we treat habitat as a moderator variable, the genotype by environment interaction is revealed: the 95% confidence interval for the mean effect is (−0·35, 0·29) in habitat A and (0·40, 1·1) in habitat B. Thus the mean effect size is not significantly different from zero in habitat A, but positive in habitat B. Also, the confidence interval for habitat B (just) captures the ‘true’ effect size of 0·4.
This example should serve to demonstrate that metaanalysis is a powerful way of synthesizing data and effectively increasing sample size to provide a more robust test of a hypothesis. However, like all statistical methods, the results of metaanalysis should be interpreted in the light of various checks and balances which can inform us as to the likely reliability of our conclusions: this is discussed in the main text.