Statistics: more than pictures



This, the third of our series of articles on statistics in veterinary medicine, moves onto the more complex concepts of hypothesis testing and confidence intervals. As these two areas are widely discussed in many clinical research publications, an awareness of the underlying methodology behind their use is essential to appreciate the information they convey.


In the previous article (Scott and others 2011), we focussed on exploration and the more subjective – but still important – aspects of visualising and summarising data, using material from a study by Bell and others (2011) (Table 1). In this article, we will highlight some more formal statistical methods, namely, hypothesis tests and confidence intervals, and introduce them for two simple experimental situations. Underpinning these methods is the idea of a probability model, and, in the last article, we introduced the normal or Gaussian distribution and some of its properties, which will serve for our model. We also showed an example of a probability plot to verify the particular distributional assumption we make. To highlight the clinical use of some of these statistical methods, we will introduce a further study in this article, an investigation of the anaesthetic sparing effect of brachial plexus block (BPB) in cats (Mosing and others 2010; Table 1). Although various quantities were measured in their experiment, we will consider only the cardiorespiratory variables (heart rate, respiratory rate and non-invasive arterial blood pressure), and, for simplicity, we will simply look at one timepoint, although measurements were made at five timepoints (including all five timepoints in our analysis will become the topic of a later article).

Table 1. Basic details of the two studies being discussed in this article
Bell and others (2011) performed a blinded investigation assessing the sedative, cardiorespiratory and propofol-sparing effects in dogs of two doses of dexmedetomidine in combination with buprenorphine, compared with acepromazine in combination with buprenorphine. Sixty dogs (20 in each group) were recruited to the study, although 1 dog was subsequently excluded. Heart rate, arterial blood pressure, respiratory rate and quality of sedation were recorded by the authors, as well as propofol dose requirements.
Mosing and others (2010) performed a blinded study evaluating the isoflurane sparing effect and the postsurgical analgesia provided by a brachial plexus nerve block in cats undergoing distal thoracic limb surgery. Twenty cats were recruited to the study, with 10 undergoing conventional anaesthesia and another 10 undergoing anaesthesia combined with a BPB. Two cats were subsequently excluded from the study (one from each group). The investigators recorded a number of cardiovascular and respiratory variables, in addition to isoflurane requirements and postoperative pain scores.

Hypothesis testing and construction of confidence intervals are part of the process of statistical inference, and Fig 1 tries to explain what inference is about.

Figure 1.

Diagram illustrating the concept of statistical inference

Statistical inference is the process whereby, on the basis of the experiment we performed, the data observed and a statistical model, we draw conclusions and make statements about the real world, not restricted to the experiment we carried out.

Before proceeding to introduce hypothesis testing and confidence intervals, it is necessary to define a basic vocabulary that is commonly used.

A parameter is generally unknown (usually because it is not possible to measure the variable on the entire population); the corresponding statistic and the variability of that statistic are then used as a tool to make inferences about the unknown parameter. This means we make a statement about a specific sample and then try to generalise it to the entire population.

We should be particularly concerned about how the sample is to be extracted from the population and how representative of the population the sample actually is that we have collected. We have seen in the previous article (Scott and others 2011) how data from the paper of Bell and others (2011) demonstrated that heart rate varied amongst study subjects even before they were assigned to any treatment, so that handling variability is an important aspect of any statistical analysis. If we repeated the experiment by Bell and others (2011), and so identified a further sample of 60 subjects, they also would vary in terms of heart rate; this is sometimes referred to as sampling variability.

The objectives of our designed experiment are concerned not with the properties of just the individuals that we have been working with, but with the properties of the wider population. In specifying the objectives, we might use statements such as “is the effect of drug A on blood pressure significant?”, “is there a difference between the mean heart rate of dogs on treatment A compared with treatment B?” and so on. The analysis of the results from suitably designed experiments will be used to answer such questions, but we only have a sample from the population and we cannot, therefore, be 100% certain about the true value of the population attribute (e.g. arterial blood pressure). The results of our experiment are subject to uncertainty, and in seeking to answer the experimental question, we need to take account of both sampling variability and also any intrinsic uncertainty (e.g. if the equipment used was only calibrated to measure arterial blood pressure to within the nearest 10 mmHg, there will be an inherent inaccuracy in the answer we obtain). There is a variety of different hypothesis tests and confidence intervals depending on the experimental design and the questions of interest, but all have certain principles in common. This article will first describe those principles and then take two simple examples to explain those principles in action.

The principles of hypothesis tests and confidence intervals

Hypothesis tests

The hypothesis testing approach requires that we formulate two competing hypotheses. These are called the null hypothesis (nothing of interest happens) and the alternative hypothesis (what we really expect to happen). The alternative hypothesis may be thought of as being the study hypothesis, whilst the null hypothesis is its opposite. These hypotheses concern one or more population parameters which the experiment is designed to examine. So carrying out a hypothesis test is about making a choice between two possible descriptions of the world. Usually, we begin by defining the study (alternative) hypothesis and take the null as its converse. So, for the experiment by Mosing and others (2010), the study hypothesis would be that BPB reduces isoflurane requirements and postoperative pain in cats following distal thoracic limb surgery, while the null hypothesis would be that BPB does not alter the anaesthetic requirements or the degree of postoperative pain. In some ways, the null hypothesis is “dull”, it is the scientific status quo, that is “no effect”. For simplicity, we will focus on what are known as two-sided hypothesis tests, where the null specifies that the population parameter is equal to a specified constant or that two parameters are the same, and the alternative hypothesis says simply the parameter is not equal to the specified constant or that the two parameters are not equal. In some situations (as in the Mosing and others study), there is information which would suggest the direction of difference (e.g. BPB pain is less than the non-BPB pain) (called one-sided tests), but we will not consider those here.

Next, we define a test statistic which summarises the evidence based on our experiment. This will be a numerical value based on the results from our experiment. The actual form of the test statistics will vary from one hypothesis test to another.

Finally, we need to define a decision rule (usually formulated to include a rejection region). The decision rule says simply that if the observed value of the test statistic falls within the rejection region, then we should reject the null in favour of the alternative, otherwise we do not reject the null.

For many of the common situations, we evaluate the test statistic and reject the null if the test statistic is larger than a critical value (read from statistical tables), or more typically, by looking at the P value generated by the computer software we are using (we would reject the null hypothesis for small P values).

This means that in reality with modern statistical software, most veterinarians can ignore the actual value of the test statistic and the rejection region and simply focus on the P value.

So what is a P value?

Before answering this question, we need to introduce some further ideas. There are two types of mistake we could make when we reach a decision in the testing framework: rejecting the null hypothesis when we should not (sometimes called type 1 error), i.e. concluding there is a genuine difference or effect when there is no difference, and not rejecting the null hypothesis when we should (sometimes called type 2 error), i.e. concluding there is no difference or effect when there actually is. In an ideal world, we would not make any mistakes, but in reality, the best we can do is attempt to control the chance of making such errors while – at the same time – recognising that we cannot reduce the chance to zero. In statistical jargon, we talk about the P value of a test, which is the probability of making a type 1 error (rejecting the null hypothesis when we should not do so).

You may often see in scientific papers, notation such as:

*P<0·05; **P<0·01; ***P<0·001

In scientific research, P<0·05 is considered to be a reasonable risk of a specific error, but the appropriate P value for significance depends on the study in question. If you have only your reputation to lose, P<0·05 is probably reasonable, but if animals’ lives are at stake, you might want a rather smaller value. In general, it is best to state the calculated P value and your conclusions and then let your readers decide if they agree with your assessment.

The P value of the test is in fact the probability of obtaining a value for the test statistic as extreme or more extreme than the actual observed value when the null hypothesis is considered true. So it is a measure of the possibility that the result you have observed may have occurred by chance; the smaller the P value, the less likely that any effect or difference is purely a “fluke.”

The principles of confidence intervals

A confidence interval is a range of credible values for a population parameter, or a difference between two population parameters. It is based on an estimate of the population parameter of interest (calculated from the sample of results obtained in our experiment) and its estimated standard error (dependent on how variable the individuals in the sample are). The estimated standard error is a statistic calculated from the sample standard deviation but is not the same as the sample standard deviation. While being related to the sampling variability, it also captures the precision with which we are able to estimate the population mean (so it is sometimes called the standard error of the mean) or the difference between two population means. The most commonly used method of interval estimation is to produce 95% confidence intervals. The justification for such intervals involves a probabilistic argument, ensuring that in the long run, 95% of such intervals will contain the true but unknown parameter value.

A common approximate form of a 95% confidence interval is


Note that this form is appropriate for examples where we are interested in the population mean; it is approximate but works well for situations where we have sample sizes of 10 or more. We will discuss in more detail when we carry out some calculations how the estimated standard error is arrived at.

Two-sided hypothesis tests and confidence intervals are based on the same statistical and probabilistic basis.

Figure 2 shows 100 confidence intervals, each one created from repeating an hypothetical “experiment.” Imagine that we take random samples of 15 healthy adult Labradors and measure their weight. Each time we take a sample, the exact individuals chosen vary and so therefore does their mean weight and the variability of the sample. Now we have done this using a computer to generate the results which is much quicker and easier than weighing Labradors. In the computer, we repeatedly generated samples of 15 values from a probability distribution with known, “true” weight. From each sample, we can calculate a 95% confidence interval for the population mean weight of healthy adult Labradors. We repeated this 100 times to simulate taking 100 samples of 15 adult, healthy Labradors and calculating the 95% confidence interval for the population mean weight. We have drawn all 100 intervals side-by-side in Fig 2. So what can we see from this graph? First, the intervals represented by the vertical lines have different lengths, and they start and finish at different points. But secondly (and more interestingly), as this is a computer-generated experiment, we know what the true population weight really is. In the figure, this is identified by the horizontal line at 25 kg. This horizontal line does not pass through every interval; indeed, it does not pass through four intervals, which means that those confidence intervals do not contain the true value, so in fact, 4% (4 of 100) of the intervals do not contain the true value. A 95% confidence interval should contain the true value 95% of the time and in our case, it contains the true value: 100–4=96% of the time. This computer simulation of an experiment illustrates that if we were to repeat the same experiment many times (whereas, in reality, we usually only do it once), in the long run, 95% of our confidence intervals will contain the true but unknown parameter value.

Figure 2.

Graph of confidence intervals generated from a hypothetical normal population with mean weight 25 kg

The confidence coefficient is the percentage of times that the method will – in the long run – capture the true population parameter. Convention tends to use 95%, but it is possible to use 99% intervals (much more conservative and the resulting interval will be wider).

What assumptions do these tests and confidence intervals require to be valid?

Strictly speaking, both the test and confidence interval above assume that the data are drawn from normally distributed populations and in the specific example of the Mosing and others study, that the cats in the two groups are independent. Deviations from normality can however be tolerated, but if the deviation is extreme, then it might be advisable to consider one of a number of possible alternatives. These include using a type of test that does not make such a distributional assumption. A non-parametric test such as the Mann–Whitney (two sample) test does not assume normality but is generally not as powerful as one that does. In addition, there is another assumption made about the variability of the populations (measured formally by calculating their variance). It is assumed that the two population variances are approximately equal. There is quite a degree of latitude in this: as a rule of thumb, if there is less than a factor of 3 between the two sample standard deviations, then this assumption will be reasonable.

Note also that the two-sided hypothesis test and the corresponding confidence interval will lead to the same statistical inference or conclusion.


Example 1: two independent samples, comparison of the population means

A two-sample t test: Using the study of Mosing and others (2010), we first consider a specific type of hypothesis test, namely, a two-sample t test, used in this case since we have two independent groups of nine cats. Although the study authors compared a number of variables between the two groups at various timepoints in the experiment, for simplicity, we will just look at the baseline heart rates immediately before the start of surgery. At this stage, all of the cats were anaesthetised, nine had received brachial plexus nerve blocks (BPB), while the other nine had not. A t test on the data from this timepoint is used to determine if there are differences between the groups even before the surgery started, as this would have implications for subsequent differences observed in heart rates between the two groups.

Null hypothesis: There is no difference in baseline mean heart rate between cats having BPB and cats not receiving BPB.

Alternative hypothesis: There is a difference in baseline mean heart rate between cats having BPB and cats not receiving BPB.

In this case, the test statistic uses (quite naturally) the observed difference between the two sample means (133 beats/min from the cats with no block and 127 beats/min from the cats that received BPB), and their standard deviations (19 and 17, respectively).

Simply put, the question is whether a difference of 6 (133 – 127) is sufficiently different to 0, that we can say that the result is unlikely to have occurred by chance if the mean heart rates in the populations of cats were really the same. We can only answer this question by taking account of the variability in the samples because this experiment only actually looked at one sample of cats with no block and one sample of cats that received the block, out of all the possible samples of cats that might have taken part in the experiment, just one of many possible sets of 18 cats that we could have observed.

Table 2 shows the statistical output for a two-sample t test for our example. In the first section, it shows the summary statistics – mean, standard deviation and the standard error of the mean (SEM) [the SEM is the standard deviation divided by the square root of n, where n is the number of samples for each group (so SEM in this case would be 19/3 and 17/3)]. The second section shows the difference between the means of the two treatments in our samples, namely, 6. It also shows the test statistic (t value) which is 0·71 and the P value for that test statistic which is 0·49. With a P value of 0·49, the chances of observing this amount of difference when indeed there is no real difference between the two treatments is about 50% (or 49% to be precise). We should note that the variability is quite large (SEM of 6·3 and 5·7) in comparison to the difference in mean heart rate (6·0), and so the difference is quite likely to have occurred by chance, even if the treatment had no effect. The authors can then conclude that there is no statistically significant difference between the mean heart rates for the two treatment groups.

Table 2. Typical output from the two-sample t test
  1. SD Standard deviation, SEM Standard error of the mean

19133·019·06·3Section 1
Difference=mu (1)–mu (2)
Estimate for difference: 6·00 Section 2
t test of difference=0 (versus not=): t value=0·71 P value=0·490 DF=16
Both use pooled SD=18·0278

What about the other pieces of information: DF=16 and the fact that a pooled standard deviation was used? Are these important in our discussion? DF stands for degrees of freedom, which is a way of expressing how many independent observations there are in arriving at a calculated statistic; it is in this case 9+9–2 or total number of observations less 2 (since we are interested in the two population mean values), but this is rather a technical point and one which we can ignore. The fact that the test has used the pooled standard deviation is more important, as it relates to an assumption that has been made for this particular version of the two-sample t test. The two-sample t test makes the assumption that heart rate is normally distributed; the normal distribution is specified in terms of the mean or average value and variance or variability of the individuals, and for this case, we are making the assumption that the variance or variability is the same in the two treatment groups. This is then estimated by calculating a pooled variance, and the form of pooling is by a weighted average of the two separate sample variances. Too much detail? Perhaps, but it is always good to be aware of the assumptions made in your statistical analysis.

A confidence interval for the difference between two population means: Continuing our analysis, an alternative approach would be to construct a confidence interval for the difference between the two population means. We will use the same information as we did for the t test, but in a slightly different way.

Remember that we explained that a common approximate form of a 95% confidence interval is


The parameter of interest here is the difference in heart rates between the two treatments, so our estimate of the population mean difference in heart rate is 6 and next, we need its estimated standard error. There are two methods of approach to the estimated standard error, the most common one being to assume that the variability in the two groups is the same (as we did in the hypothesis test above where we used the pooled standard deviation), as only in this way are the probability assumptions satisfied.

What about the estimated standard error of that difference? This is calculated using the standard error of the mean for each group. Numerically, this would result in the estimated standard error of the difference being 8·495. Tables 3 and 4 explain this in more detail.

Table 3. Calculation of the estimated standard error of the difference between two population means
  1. First we must use the pooled standard deviation which, from Table 2, is 18·028. Then, we calculate the SEM for each group, so for group 1, 18·028/3=6·009 and for group 2 also 6·009. Squaring and then adding these two figures together gives 72·216

29127·017·0 5·7
Table 4. Calculation of the confidence interval for the difference between two population means
  1. 6 ±2×8·498 or 6±16·996 which gives an interval from 6–16·996 to 6+16·996 or –11 to 23

Mean difference=133–127=6·0
Standard error of the difference=8·498

So the difference in population mean heart rate could plausibly lie somewhere between –11 and 23. The important thing is that this interval includes the value 0, so the two population mean heart rates (for cats with BPB and cats without BPB) are plausibly the same.

The conclusions from both the hypothesis test and the confidence interval are the same, but if given a choice, we would recommend that you use confidence intervals. Why? Suppose you are in a hypothetical situation similar to the one above, but the hypothesis test leads you to reject the null hypothesis, so your conclusion would be that there is a statistically significant difference in the population mean heart rate. Would the next obvious question not be, how much of a difference, or is treatment 1 mean heart rate higher or lower than that of treatment 2? To answer these follow-on questions, we need a confidence interval, so why not use it immediately?

Let us extend this section to consider another common type of question.

Example 2: paired data

The next experimental situation we could consider is that of paired data. In the first instance, consider a situation where the same individual has the same attribute (e.g. heart rate) measured on two separate occasions (e.g. before and after a treatment). Here, the questions of interest typically concern the change, if any, which has occurred as a result of the treatment. In this case, we assume that the paired observations are related, so we cannot treat the problem as though we had two independent samples as we did when we considered the two individual groups in the study by Mosing and others (2010).

The question of interest concerns evidence for a difference in the population mean values before and after. The simple and elegant solution is to simply calculate the difference between two measurements of heart rate for the same individual and analyse. This would lead to a one-sample t test.

So let us return to the study by Bell and others (2011) and consider one group of dogs and the measurement of heart rate at two different timepoints in the experiment (Fig 3).

Figure 3.

Scatterplot of heart rate at timepoints 1 and 2 with the line of equality

How has heart rate changed from timepoint 1 to timepoint 2?

Each dot represents an individual dog, and we can compare heart rate measured at the two timepoints on the same dog (only 18 dots are visible as there are three dogs with a heart rate of 80/min before and after treatment, and these dots are superimposed on the graph). The line represents the line of equality. Points that lay on this line would represent dogs for which the heart rates at the two timepoints were the same. This type of scatterplots, where the scales in the horizontal and vertical axes are the same and the line of equality has been included, are very good for paired data experiments.

Has heart rate changed between the two timepoints? In Fig 3, it looks as though there are more points above the line than below, so that would suggest that heart rate at timepoint 1 was higher than that at timepoint 2 for the majority of dogs, which might suggest that heart rate has decreased. Let us calculate the differences for each individual dog and see what they show (Fig 4).

Figure 4.

Dotplot of difference in heart rate between the two timepoints

Table 5 and Fig 4 summarise the differences. But we need to know how the difference was calculated before we can discuss its interpretation. In this case, the difference was calculated as timepoint 1 heart rate minus timepoint 2 heart rate. The mean difference is 1·70, so on average, timepoint 1 heart rates were 1·7 bpm higher (that is quite small), and more importantly, the standard error was 2·77 (which is greater than the mean difference). If the variability is relatively large, different samples taken from the population will have quite different means. This is not a problem if the difference between treatments is large, but if the difference is small, there is a high risk that a single sample from each treatment will produce a difference that is not representative of the real difference produced by the treatment. So, informally, there is no evidence of the change in mean heart rate. However, as statisticians, we need to investigate formally.

Table 5. Summary statistics for difference in heart rate between the two timepoints
Descriptive statistics: difference
Difference in mean HRnn*MeanSEMSDMinimumQ1MedianQ3
  1. HR Heart rate, SEM Standard error of the mean, SD Standard deviation


We can again use a hypothesis test approach or a confidence interval approach.

For the hypothesis test, the null hypothesis would stipulate that the two population means are the same, while the alternative would stipulate they are different. Table 6 presents both the test and confidence interval results.

Table 6. Summary of hypothesis test and confidence interval
Difference in mean HRnMeanSDSEM95% CITP
  1. HR Heart rate, SEM Standard error of the mean, SD Standard deviation

 201·7012·392·77(–4·10, 7·50)0·610·547

So first let us focus on the test: the test statistic is 0·61 and the P value is 0·547, so we would see a test statistic as large as this nearly 55% of the time purely by chance. We therefore conclude that we cannot reject the null hypothesis (because of the large P value which is far greater than the nominal 0·05 significance level that is frequently used). From this, we conclude that it is plausible that the mean heart rates are the same at the two different timepoints. The 95% confidence interval for the mean difference is –4·1 to 7·5 and as this interval includes 0, it is entirely plausible that there is no difference in the population mean heart rates.

We now consider another example, this time to compare blood pressure (BP). This is tackled in the same way as above but we can explore how the results and our interpretation differ when we find a statistically significant difference as you will see.

In Fig. 5, we can see that there seems to be more dots above the line of equality than below the line, suggesting that in more cases, the mean BP at timepoint 2 is higher than that at timepoint 1. We can also see one observation where the mean BP at timepoint 2 is recorded at 150, which is substantially higher than any other value.

Figure 5.

Scatterplot of mean BP at timepoints 1 and 2 with the line of equality

From Fig 6, we can see that the differences are predominantly positive, ranging from 0 to approximately 28 and that there is one observation at over 70 (which is rather odd being much larger than the next largest difference).

Figure 6.

Dotplot of difference in mean BP between the two timepoints

Table 7 summarises the differences and also reports the 95% confidence interval for the difference in mean BP as 0·18 to 19·92, with the P value reported at 0·046.

Table 7. Summary descriptive statistics and confidence interval for difference in mean BP
  1. BP Blood pressure

Descriptive statistics: difference in mean BP
Difference maximum in mean BPnn*MeanSEMSDMinimumQ1MedianQ3
One-sample t test: difference in mean BP
Test of mu=0 versus not=0
Difference in mean BPnMeanSDSEM95% CITP  
 2010·0521·094·72(0·18, 19·92)2·130·046  

How would we report the findings? Well first, as the 95% confidence interval for the difference does not include 0, we can conclude that there is a statistically significant difference in mean BP between timepoint 1 and timepoint 2, with the mean BP at timepoint 2 highly likely to be somewhere between 0·18 and 19·92 mmHg higher than at timepoint 1. Additionally (although not necessary), we could also say that the P value is less than 0·05, so at the 5% level, we can reject the null hypothesis that the mean BP is the same at timepoints 1 and 2 in favour of the alternative hypothesis that the mean BP at timepoints 1 and 2 is different.

A small warning note: the confidence interval just excludes 0, the P value is just less than 0·05 and we have a worryingly large observed difference at 70. How much of an influence has this observation had on the result? Perhaps more than we might imagine. If we remove this unusual observation and repeat the last step in the analysis as reported in Table 8, suddenly the confidence interval includes 0 and the P value (while still small) is greater than 0·05. So now we would conclude that there is no statistically significant difference.

Table 8. Confidence interval for difference in mean BP
  1. BP Blood pressure

One-sample t test: difference in mean BP (excluding one data value)
Test of mu=0 versus not=0
Difference in mean BPnMeanSDSEM95% CITP
 196·7915·663·59(–0·76, 14·34)1·890·075

Of course, we are not recommending that this observation should be removed, we simply wanted to highlight that single observations do matter and we should remain alert to the possibility that our findings depend critically on (or are sensitive to) a small number of potentially unusual or influential observations.

Significance versus importance

One final point: Statistical significance does not always translate into practical importance; a very small effect may be of statistical significance but may be of no practical significance whatsoever. Significant results can only be given appropriate importance by reference back to the actual problem that the study is addressing. Statistical analysis only sorts out the shades of grey; it does not tell you the answer to your problem.

Remember, the fact that a statistical test gives a significant result does not necessarily indicate that the result is important.


Simply put, hypothesis tests and confidence intervals are powerful inferential tools, taking you beyond the results of your experiment to the “bigger picture.” You need to be careful (as always), because assumptions will be made that should be checked and pictures and diagrams have an important role to inform your analysis. There are many different hypothesis tests and confidence intervals for different experimental situations, but they are all built on the same principles as those introduced here. We have avoided using mathematical formulae, but of course, the tests and intervals have a mathematical foundation (based on the theory of estimation and probability). Interested readers who wish to see a fuller, more in depth development should refer to the studies of Altman and Bland (2005) and Bland and Altman (1986).

Conflicts of interest

None of the authors of this article has a financial or personal relationship with other people or organisations that could inappropriately influence or bias the content of the paper.