Example 1: two independent samples, comparison of the population means
A two-sample t test: Using the study of Mosing and others (2010), we first consider a specific type of hypothesis test, namely, a two-sample t test, used in this case since we have two independent groups of nine cats. Although the study authors compared a number of variables between the two groups at various timepoints in the experiment, for simplicity, we will just look at the baseline heart rates immediately before the start of surgery. At this stage, all of the cats were anaesthetised, nine had received brachial plexus nerve blocks (BPB), while the other nine had not. A t test on the data from this timepoint is used to determine if there are differences between the groups even before the surgery started, as this would have implications for subsequent differences observed in heart rates between the two groups.
Null hypothesis: There is no difference in baseline mean heart rate between cats having BPB and cats not receiving BPB.
Alternative hypothesis: There is a difference in baseline mean heart rate between cats having BPB and cats not receiving BPB.
In this case, the test statistic uses (quite naturally) the observed difference between the two sample means (133 beats/min from the cats with no block and 127 beats/min from the cats that received BPB), and their standard deviations (19 and 17, respectively).
Simply put, the question is whether a difference of 6 (133 – 127) is sufficiently different to 0, that we can say that the result is unlikely to have occurred by chance if the mean heart rates in the populations of cats were really the same. We can only answer this question by taking account of the variability in the samples because this experiment only actually looked at one sample of cats with no block and one sample of cats that received the block, out of all the possible samples of cats that might have taken part in the experiment, just one of many possible sets of 18 cats that we could have observed.
Table 2 shows the statistical output for a two-sample t test for our example. In the first section, it shows the summary statistics – mean, standard deviation and the standard error of the mean (SEM) [the SEM is the standard deviation divided by the square root of n, where n is the number of samples for each group (so SEM in this case would be 19/3 and 17/3)]. The second section shows the difference between the means of the two treatments in our samples, namely, 6. It also shows the test statistic (t value) which is 0·71 and the P value for that test statistic which is 0·49. With a P value of 0·49, the chances of observing this amount of difference when indeed there is no real difference between the two treatments is about 50% (or 49% to be precise). We should note that the variability is quite large (SEM of 6·3 and 5·7) in comparison to the difference in mean heart rate (6·0), and so the difference is quite likely to have occurred by chance, even if the treatment had no effect. The authors can then conclude that there is no statistically significant difference between the mean heart rates for the two treatment groups.
Table 2. Typical output from the two-sample t test
|Difference=mu (1)–mu (2)|
|Estimate for difference: 6·00 Section 2|
|t test of difference=0 (versus not=): t value=0·71 P value=0·490 DF=16|
|Both use pooled SD=18·0278|
What about the other pieces of information: DF=16 and the fact that a pooled standard deviation was used? Are these important in our discussion? DF stands for degrees of freedom, which is a way of expressing how many independent observations there are in arriving at a calculated statistic; it is in this case 9+9–2 or total number of observations less 2 (since we are interested in the two population mean values), but this is rather a technical point and one which we can ignore. The fact that the test has used the pooled standard deviation is more important, as it relates to an assumption that has been made for this particular version of the two-sample t test. The two-sample t test makes the assumption that heart rate is normally distributed; the normal distribution is specified in terms of the mean or average value and variance or variability of the individuals, and for this case, we are making the assumption that the variance or variability is the same in the two treatment groups. This is then estimated by calculating a pooled variance, and the form of pooling is by a weighted average of the two separate sample variances. Too much detail? Perhaps, but it is always good to be aware of the assumptions made in your statistical analysis.
A confidence interval for the difference between two population means: Continuing our analysis, an alternative approach would be to construct a confidence interval for the difference between the two population means. We will use the same information as we did for the t test, but in a slightly different way.
Remember that we explained that a common approximate form of a 95% confidence interval is
The parameter of interest here is the difference in heart rates between the two treatments, so our estimate of the population mean difference in heart rate is 6 and next, we need its estimated standard error. There are two methods of approach to the estimated standard error, the most common one being to assume that the variability in the two groups is the same (as we did in the hypothesis test above where we used the pooled standard deviation), as only in this way are the probability assumptions satisfied.
What about the estimated standard error of that difference? This is calculated using the standard error of the mean for each group. Numerically, this would result in the estimated standard error of the difference being 8·495. Tables 3 and 4 explain this in more detail.
Table 3. Calculation of the estimated standard error of the difference between two population means
Table 4. Calculation of the confidence interval for the difference between two population means
|Standard error of the difference=8·498|
So the difference in population mean heart rate could plausibly lie somewhere between –11 and 23. The important thing is that this interval includes the value 0, so the two population mean heart rates (for cats with BPB and cats without BPB) are plausibly the same.
The conclusions from both the hypothesis test and the confidence interval are the same, but if given a choice, we would recommend that you use confidence intervals. Why? Suppose you are in a hypothetical situation similar to the one above, but the hypothesis test leads you to reject the null hypothesis, so your conclusion would be that there is a statistically significant difference in the population mean heart rate. Would the next obvious question not be, how much of a difference, or is treatment 1 mean heart rate higher or lower than that of treatment 2? To answer these follow-on questions, we need a confidence interval, so why not use it immediately?
Let us extend this section to consider another common type of question.
Example 2: paired data
The next experimental situation we could consider is that of paired data. In the first instance, consider a situation where the same individual has the same attribute (e.g. heart rate) measured on two separate occasions (e.g. before and after a treatment). Here, the questions of interest typically concern the change, if any, which has occurred as a result of the treatment. In this case, we assume that the paired observations are related, so we cannot treat the problem as though we had two independent samples as we did when we considered the two individual groups in the study by Mosing and others (2010).
The question of interest concerns evidence for a difference in the population mean values before and after. The simple and elegant solution is to simply calculate the difference between two measurements of heart rate for the same individual and analyse. This would lead to a one-sample t test.
So let us return to the study by Bell and others (2011) and consider one group of dogs and the measurement of heart rate at two different timepoints in the experiment (Fig 3).
How has heart rate changed from timepoint 1 to timepoint 2?
Each dot represents an individual dog, and we can compare heart rate measured at the two timepoints on the same dog (only 18 dots are visible as there are three dogs with a heart rate of 80/min before and after treatment, and these dots are superimposed on the graph). The line represents the line of equality. Points that lay on this line would represent dogs for which the heart rates at the two timepoints were the same. This type of scatterplots, where the scales in the horizontal and vertical axes are the same and the line of equality has been included, are very good for paired data experiments.
Has heart rate changed between the two timepoints? In Fig 3, it looks as though there are more points above the line than below, so that would suggest that heart rate at timepoint 1 was higher than that at timepoint 2 for the majority of dogs, which might suggest that heart rate has decreased. Let us calculate the differences for each individual dog and see what they show (Fig 4).
Table 5 and Fig 4 summarise the differences. But we need to know how the difference was calculated before we can discuss its interpretation. In this case, the difference was calculated as timepoint 1 heart rate minus timepoint 2 heart rate. The mean difference is 1·70, so on average, timepoint 1 heart rates were 1·7 bpm higher (that is quite small), and more importantly, the standard error was 2·77 (which is greater than the mean difference). If the variability is relatively large, different samples taken from the population will have quite different means. This is not a problem if the difference between treatments is large, but if the difference is small, there is a high risk that a single sample from each treatment will produce a difference that is not representative of the real difference produced by the treatment. So, informally, there is no evidence of the change in mean heart rate. However, as statisticians, we need to investigate formally.
Table 5. Summary statistics for difference in heart rate between the two timepoints
|Descriptive statistics: difference|
|Difference in mean HR||n||n*||Mean||SEM||SD||Minimum||Q1||Median||Q3|
We can again use a hypothesis test approach or a confidence interval approach.
For the hypothesis test, the null hypothesis would stipulate that the two population means are the same, while the alternative would stipulate they are different. Table 6 presents both the test and confidence interval results.
Table 6. Summary of hypothesis test and confidence interval
|Difference in mean HR||n||Mean||SD||SEM||95% CI||T||P|
| ||20||1·70||12·39||2·77||(–4·10, 7·50)||0·61||0·547|
So first let us focus on the test: the test statistic is 0·61 and the P value is 0·547, so we would see a test statistic as large as this nearly 55% of the time purely by chance. We therefore conclude that we cannot reject the null hypothesis (because of the large P value which is far greater than the nominal 0·05 significance level that is frequently used). From this, we conclude that it is plausible that the mean heart rates are the same at the two different timepoints. The 95% confidence interval for the mean difference is –4·1 to 7·5 and as this interval includes 0, it is entirely plausible that there is no difference in the population mean heart rates.
We now consider another example, this time to compare blood pressure (BP). This is tackled in the same way as above but we can explore how the results and our interpretation differ when we find a statistically significant difference as you will see.
In Fig. 5, we can see that there seems to be more dots above the line of equality than below the line, suggesting that in more cases, the mean BP at timepoint 2 is higher than that at timepoint 1. We can also see one observation where the mean BP at timepoint 2 is recorded at 150, which is substantially higher than any other value.
From Fig 6, we can see that the differences are predominantly positive, ranging from 0 to approximately 28 and that there is one observation at over 70 (which is rather odd being much larger than the next largest difference).
Table 7 summarises the differences and also reports the 95% confidence interval for the difference in mean BP as 0·18 to 19·92, with the P value reported at 0·046.
Table 7. Summary descriptive statistics and confidence interval for difference in mean BP
|Descriptive statistics: difference in mean BP|
|Difference maximum in mean BP||n||n*||Mean||SEM||SD||Minimum||Q1||Median||Q3|
|One-sample t test: difference in mean BP|
|Test of mu=0 versus not=0|
|Difference in mean BP||n||Mean||SD||SEM||95% CI||T||P|| || |
| ||20||10·05||21·09||4·72||(0·18, 19·92)||2·13||0·046|| || |
How would we report the findings? Well first, as the 95% confidence interval for the difference does not include 0, we can conclude that there is a statistically significant difference in mean BP between timepoint 1 and timepoint 2, with the mean BP at timepoint 2 highly likely to be somewhere between 0·18 and 19·92 mmHg higher than at timepoint 1. Additionally (although not necessary), we could also say that the P value is less than 0·05, so at the 5% level, we can reject the null hypothesis that the mean BP is the same at timepoints 1 and 2 in favour of the alternative hypothesis that the mean BP at timepoints 1 and 2 is different.
A small warning note: the confidence interval just excludes 0, the P value is just less than 0·05 and we have a worryingly large observed difference at 70. How much of an influence has this observation had on the result? Perhaps more than we might imagine. If we remove this unusual observation and repeat the last step in the analysis as reported in Table 8, suddenly the confidence interval includes 0 and the P value (while still small) is greater than 0·05. So now we would conclude that there is no statistically significant difference.
Table 8. Confidence interval for difference in mean BP
|One-sample t test: difference in mean BP (excluding one data value)|
|Test of mu=0 versus not=0|
|Difference in mean BP||n||Mean||SD||SEM||95% CI||T||P|
| ||19||6·79||15·66||3·59||(–0·76, 14·34)||1·89||0·075|
Of course, we are not recommending that this observation should be removed, we simply wanted to highlight that single observations do matter and we should remain alert to the possibility that our findings depend critically on (or are sensitive to) a small number of potentially unusual or influential observations.