How many subjects do we need in a study, what is a good number to have? If a statistician received £1 or Ä1 or $1 for each time they are asked such a question they would be rich (a more guaranteed income than playing any lottery or game of chance). This simple question is sometimes described as sample size or power calculations, and we will introduce a small piece of terminology in our discussion: the members of the population of interest will be described as experimental units or subjects, while the sample will be the set of subjects of experimental units that we select.
“How many?” is a deceptively simple question, to which you might hope there was a stunningly simple answer. Sadly not; for this one question, the professional statistician is going to ask possibly as many as four further questions, some of which may not be so simple to answer.
However, this deceptively simple question and its answer are crucial to the successful design of your experimental studies (regardless of ultimately whether any change, difference, or improvement is found to be statistically significant). They may also be key to receiving ethical approval and, indeed, funding.
In our discussion of how we can answer this question, we will meet power and significance, both of which we introduced in the previous article on hypothesis testing (Scott and others 2012). The other quantity that we will encounter is the estimated standard error (ese) since, depending on the context of the problem, we may wish to consider how many experimental units (subjects) are needed to estimate a population quantity with a specified precision (so more akin to the confidence interval idea also met in that earlier article).
We will re-discover how important variability is and what its implications are for answering the question. Finally, we will consider what options might exist if the answer is impractical and infeasible.
Commonly the objectives of our designed experiment are phrased in terms of properties of the population, in statements such as “is the population mean percentage body weight different from 15%?”, “does drug A significantly lower blood pressure?”, “is there a difference between the mean heart rate for treatments A and B?”
We will use as illustrations, examples where the question is posed in terms of the number of experimental units (subjects) to estimate a population mean with a specified precision or to detect a difference of a given size (clinically relevant) between two population means. But before we begin, let us refresh our memories about power and estimated standard errors – key components in our sample size calculations.
In the earlier discussion of hypothesis testing (Scott and others 2012), we mentioned two types of error – type 1 error, where we reject the null hypothesis when we should not, the probability of which was the P-value of the test, and type 2 error, where we do not reject the null hypothesis when we should. This latter error is related to the power of a test, which was (1-probability of making a type 2 error). The power is then the probability of rejecting the null hypothesis when we should, so clearly we want power to be as high as possible. Conventionally, we might hope to achieve a power of 80% or higher.
Unfortunately, power and P-value are both related, and it is generally not possible to simultaneously maximise the power while minimising the probability of a type 1 error. We compromise, and convention is that we control the probability of a type 1 error (less than 0·05) and then find the procedure which gives the best power. Power is also affected by several other factors as we will see later.
Estimated Standard Error And Precision
In the previous article (Scott and others 2012), when examining data from the Bell and others (2011) article, we saw how heart rate varied amongst study subjects (even before they were assigned to any treatment group). This can be quantified in terms of the standard deviation (sd) or variance. However, when we consider estimating say the mean heart rate, then what good an estimate without a measure of uncertainty – enter the estimated standard error (ese), sometimes also called the standard error of the mean.
Precision is concerned with how little or how much uncertainty there is associated with our estimate, so you may, in an article, see statistics quoted as mean heart rate (bpm) of 118 ±1·65. But consider carefully if 1·65 is the sd or the estimated standard error, and does the article make this clear. There is a critical distinction.
If the estimated standard error is being quoted, then 1·65 is a measure of precision; it tells us how well we know the mean value – a large estimated standard error compared with a small estimated standard error is the difference between being imprecise and precise. If imprecise, then there is considerable variability. In another sense, if we constructed a 95% confidence interval for the population mean, then the width of the confidence interval is a measure of precision – wide intervals are imprecise.
If the ± is the sd, then 1·65 tells us how much heart rate varies across this study subjects. Standard deviation and estimated standard error are related: the estimated standard error is the sd divided by the square root of n, the number of subjects.
How Large A Sample Is Required?
Now to the matter in hand; If you ask a statistician “How large a sample is required?”, you need to be prepared to answer four questions. The problem is that the answer depends on a series of matters related to your experiment. The sample size might be viewed as another way of saying “how much information do I need?” So, first let us consider the ingredients needed to answer this question, by asking this further series of questions that shed light on the “How many?” question:
(a) How big is the effect that you are looking for likely to be? To put it another way, how much of a difference would be of real world importance (clinical significance) and, hence, is important to be able to detect? The amount of information needed will depend on how big the effect that you are looking for actually is. So, if you are looking for a tiny effect, you will need to look at a lot of individuals in order to see it – you need to “magnify” the effect by examining a large number of them. If, on the other hand, the effect is very large, then it will be obvious when you examine just a few. This is about what size of effect would be of clinical relevance or genuinely makes a difference – i.e. clearly a clinical judgement on your part and not something that a statistician can be much help with, except to ask the question.
(b) How variable is the quantity that you are measuring likely to be in the population? If it is very variable, you will need to look at a lot of individuals to get a reliable estimate of the effect you are looking for, and if it does not vary much, a small number of individuals will do. In most cases, someone, somewhere, will have done something on similar animals and will have published a measure of the variability as part of their study. If what you are doing is so novel that there is nothing published, i.e. remotely similar, then you might need to do a small pilot study to estimate this. What we need is a ballpark figure as to roughly how variable the individuals that you are dealing with are in the quantity(ies) concerned.
(c) How sure do you want to be that if the effect is really there that you will find it? The more sure that you want to be, the more effort (and thus the larger the sample size) that you will have to put in. There is no such thing as certainty in this, and the more sure you want to be, the more it will cost you, and there is a law of diminishing returns in this. This is the power of the experiment, and there needs to be a reasonable trade-off between finding things out and having to have an impossibly large sample size. In setting the power at 80%, this defines that we have a probability of 0·8 to find the effect if it is there. Increasing the power will require more subjects. This is often formally written as (1-β).
(d) How much risk are you prepared to take that you will think that there is an effect, when in fact what you have seen is simply the result of chance? The more sure you want to be in your answer (i.e. the less likely to draw an incorrect conclusion) then the more measurements you will need. In essence this is about how much you value your reputation, since if you publish a result that was simply produced as a result of chance, it will not subsequently be validated in other people's work. Here, the answer depends on the consequences of saying that something causes an effect when in fact it does not. In ordinary science this is set at 5% and is what we have referred to in our earlier articles as the significance level (Scott and others 2012). If, however, you are researching a potential treatment that will cause a great deal of distress and, therefore, from an ethical perspective you want to be rather more sure that it is really having the effect that you think that it is, you might set the level at 1% or even 0·1%. Reducing the significance level will require more subjects. This is often formally written as α.
The problem is not as bad as it may appear. There are values that are routinely chosen for questions (c) and (d) and once you have answers to (a) – (d) above, you can calculate how large a sample is required. One helpful way to think of these questions is as a “thought experiment” before you carry out the actual experiment. And, if you have answers to any four of the five questions, you can answer the fifth, so that for instance, if you know (b) – (d) and how many experimental units you have, then you can work out the size of effect you could detect. Alternatively, if you know how many, and answers to (a), (b) and (c), then you can work out the power of being able to detect the effect if it is present. This all sounds complex but hopefully with a few examples, we can illustrate the concepts.
Example 1: Sample sizes in the 1-group case
Imagine that we wish to estimate the population mean weight in healthy, 2 to 4-year-old Beagles. Figure 1 below shows 95% confidence intervals for the population mean weight based on a sample size of n=10, n=20, n=50, n=100 and n=1000 dogs. Of course, these results are simulated from a computer model, which specified that weight was normally distributed with mean 15 and sd of 1, but what can we notice about the intervals? They are clearly different, and as n increases the width of the interval decreases, until with n=1000 we have a very narrow interval. The summary statistics for each data set are described in Table 1.
We can see that the means are all different, ranging from approximately 14·37 to 15·01, and the sd range from approximately 0·77 to 1·10 – so, different, but also quite similar. Now study the estimated standard error (se mean): it has decreased from approximately 0·28 to 0·03, virtually a 10-fold reduction going from n=10 to n=1000.
Now imagine that the question was posed: “How many dogs would be required to estimate the population mean with a 95% confidence interval of a certain width (suppose we chose a value of 0·196)?”. The width of a 95% confidence interval is roughly 4 times the estimated standard error, so, in other words, with an estimated standard error of 0·05 (i.e. 0·196/4)? This is the first example of sample size calculations (Table 2).
|From Table 1, common sense would suggest that for an ese of 0·05, we would need n somewhere between 100 (ese of 0·08) and 1000 (ese of 0·03). To calculate n, we need to remember that the estimated standard error, ese=population sd divided by the square root of n. For our Beagle dog population, the sd is 1 (stated below taken from our computer model), so|
|i.e. we would need to recruit 400 Beagles to this study group.|
In the more common type of problem, say where we are comparing two different treatments, it is useful to introduce a new tool – the power curve. Power curves plot every combination of power versus size of difference for specified test conditions (size of α, sd). They are used to help determine the appropriate sample size or power for your test.
The easiest way to demonstrate power curves is to actually explore a couple of examples – the first is completely hypothetical (example 2 below) and rather absurd, but we have done this to emphasise the importance of thinking about power, while the subsequent example (example 3) is more realistic.
Power And Sample Size Calculations For A Difference (2-Group Case)
As a second example, suppose that we want to compare two treatments and their effect on mean blood pressure (BP) in dogs. This study design requires two groups of dogs: one group receiving treatment A, one group receiving treatment B. We will also assume that the same number of dogs is recruited to each treatment, so how many dogs do we need to recruit in total? Table 3 shows the thought process to identify the necessary components to answer this question.
|First identify the question: what size of difference in mean BP is important to detect?|
|In our example, let us imagine that the mean difference in BP between the two treatment groups will be 4·8 mmHg.|
|What is the sd of the population? To answer this, we might need a pilot study or we may be able to get an answer from the literature. Again if we imagine that the variability of BP in the population of dogs is say 1 sigma (σ)=47.|
|We can immediately conclude that the true treatment effect is relatively modest, and the variability in the population is large, so using a small sample size for each group is unlikely to find this effect.|
|Take α to be 0·05, and imagine that we have carried out a small study with 14 subjects receiving treatment A and 14 subjects receiving treatment B.|
|What is the probability of detecting the difference of 4·8 mmHg (in other words what is the power of the study)?|
Figure 2 shows a power curve, with a sample size of 14, a difference of 4·8, and a sd of 47. The dot on the curve identifies the power, which in this case is approximately 0·085 (on the vertical axis). This value is so low that the experiment is not worth doing.
If we use the same experimental situation, but this time stipulate that we require the power to be 90%, with σ=47 and α=0·05, what should be n? Figure 3 shows the new curve, and this time n=2016 (which is a very large number of subjects) in both groups.
Finally, what if we could reduce the variability by a factor of 4, so that σ=11·75 and we keep everything else the same. Figure 4 shows that, to find a difference of 4·8 with 90% power, we would need n=127 (so a total sample size of 127+127).
Although not shown, with this variability reduction, if we could still only manage n=14 subjects, the power will increase substantially to become 0·18 instead of 0·08 (nearly a 20% chance of detecting it rather than under 10%).
The third example will be based on a rather common type of study (and using more realistic values), involving a comparison of the cardiovascular effects of two intravenous anaesthetic induction drugs, and where subjects have their heart rate and blood pressure measured at baseline and postinduction. It is hypothesised that, if the mean heart rate on drug B is 10 bpm lower than the mean heart rate on drug A postinduction, then this is clinically significant. How many subjects are needed to have an 80% power of detecting such a difference? On carrying out a pilot study with the two drugs, we obtain the results in Table 4.
|Group||Baseline blood pressure (mmHg)||Baseline heart rate (bpm)||Postinduction blood pressure (mmHg)||Postinduction heart rate (bpm)|
|Drug A||92 (sd 10)||110 (sd 24)||65 (sd 22)||122 (sd 22)|
|Drug B||84 (sd 18)||92 (sd 17)||73 (sd 16)||118 (sd 16)|
We have decided that a difference of 10 bpm is clinically significant, and that we wish to have a power of 80%, but we still need to consider the significance level of the test and the population sd. This last quantity we can estimate from the pilot study results in Table 4. Using the baseline heart rate figures, we estimate the population sd to be 20·8. This is called the pooled sd and is calculated based on the combination (a weighted average) of the sd of 24 and 17 in the table above (the precise mechanism by how this figure is reached is not necessary here). The only information taken from the pilot study concerns the variability in heart rate. One question might be: how large should the pilot study be? The main role of the pilot study here has been to identify the variability – only a rough estimate is needed, so between 5 and 10 animals is usually sufficient.
Figure 5 shows the resulting power curve based on the pilot study data from Table 4. This curve looks a little different from the earlier figures, and the reason for this is that we stipulated in our question that drug B lowered heart rate by 10 bpm (so this would be a one-sided question) compared with drug A.
The result is that we need a sample size of 55 dogs in each treatment group.