SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References

This short addition to our series on clinical statistics concerns relationships, and answering questions such as “are blood pressure and weight related?” In a later article, we will answer the more interesting question about how they might be related. This article follows on logically from the previous one dealing with categorical data, the major difference being here that we will consider two continuous variables, which naturally leads to the use of a Pearson correlation or occasionally to a Spearman rank correlation coefficient.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References

Consider two situations: in the first, we measure the same quantity (e.g. heart rate) before and after exercise on a set of individuals; in the second, we measure heart rate and blood pressure on each individual in the set. In situation 1, we might typically be interested in the change in heart rate before and after exercise, i.e. the difference in heart rate, while in situation 2, we might be interested in whether a high heart rate is associated with high blood pressure or vice versa. In this article – which follows on logically from the previous one dealing with categorical data (Scott et al. 2013) – the main focus will be on situation 2, but in the last section we will revisit situation 1, which is known as a matched pairs design.

When two different variables have been measured on the same individual, the statistical analysis of the data will typically address questions concerning the relationship or dependence between the variables. The Pearson correlation coefficient is a measure of the strength of the linear relationship between two variables. It is a summary statistic, so a number that summarises the strength, but as we see in a later article, we might also wish to model the dependence of one variable on another and this can be done using regression techniques.

As we have emphasised in all our articles, graphs will play an important role, since the correlation coefficient provides a summary based on an assumption that the relationship is linear. Scatterplots provide evidence of the relationship and its nature (linear or otherwise). So we will meet scatterplots and matrix plots (which let us explore the pairwise relationships from a data set with potentially many variables).

The examples that we will use to illustrate the ideas are taken from Bell et al. (2011), and studies on trochlear notch sclerosis (TNS Trochlear notch sclerosis; D. Draffan, unpublished data) and mastitis in cows (A. Nolan, unpublished data), as well as a brief examination of results from a gene experiment.

So let's begin with some examples of scatterplots.

Scatterplots (Of All Varieties)

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References

The examples below show a variety of different variants of the simple scatterplot.

Example 1: trochlear notch sclerosis

In this example, a sample of dogs attending the University of Glasgow Small Animal Hospital had the TNS ratio measured at the same time as a number of other demographic variables, including body weight, sex and age (TNS is a phenomenon associated with fragmented medial coronoid process). The scatterplot in Fig 1 shows TNS ratio plotted against weight: there is a great deal of scatter and no very clear relationship. The scatterplots in Figs 2 and 3 show how we can introduce a third variable, age (categorised into three groups) or sex (male or female) into our exploration of relationships, since we need to be alert to the possibility that relationships and dependencies might be different in different subgroups.

image

Figure 1. TNS ratio versus body weight. TNS Trochlear notch sclerosis

Download figure to PowerPoint

image

Figure 2. TNS ratio versus body weight in different age groups

Download figure to PowerPoint

image

Figure 3. TNS ratio versus body weight for both male and female dogs

Download figure to PowerPoint

All three figures provide no clear evidence of a relationship between TNS and weight – the points are randomly scattered.

Example 2: mastitic cows

The second example involves a study of 375 mastitic cows, where the Somatic Cell Counts (SCC) and Acute Phase Proteins (APP) have been measured in milk samples. At the same time, body temperature, heart rate and respiratory rate were also measured. The scatterplot in Fig 4 shows APP plotted against SCC. It demonstrates some interesting features, the most notable of which is that the data points are very much clustered close to the zero APP value, and that the axes have been stretched with a small number of very high SCC and APP values, suggesting that we should consider transforming the variables (starting with a log transformation). Using transformations is quite a common statistical technique with the log transformation being one of the most frequently used. There are several purposes of using transformations including (1) producing a more symmetrical distribution of values and, (2) showing the data on a more natural scale, as is the case in this example.

image

Figure 4. Scatterplot of APP versus SCC. APP Acute phase proteins, SCC Somatic cell counts

Download figure to PowerPoint

The second plot (Fig 5), where APP and SCC have been replaced by log(APP) and log(SCC), shows more clearly the relationship that exists between the two variables. Imagine trying to draw a straight line through the mass of the points from bottom left corner to top right corner: the points will scatter round this line. We can conclude that as log(SCC) increases then so also does log(APP). Because of the mathematical nature of the log transformation, this means that as SCC increases then so also does APP. Thus, although we are now working on the log scale, we can still make statements about the original variables.

image

Figure 5. Scatterplot of log(APP) versus log(SCC)

Download figure to PowerPoint

Continuing with the exploration of the data set on mastitic cows, the third plot (Fig 6) shows what is called a matrixplot, useful when we want to explore the relationships between more than two variables. There are three scatterplots, those shown in the first row are the scatterplots of rectal temperature versus heart rate and rectal temperature versus respiratory rate. The second row shows the scatterplot of heart rate versus respiratory rate.

image

Figure 6. Matrixplot of rectal temperature, heart and respiratory rate

Download figure to PowerPoint

There are no strong relationships apparent between the variables as seen in the scatterplots. There may be some evidence of weak relationships between respiratory rate and rectal temperature in that increasing respiratory rate seems associated with a minor increase in rectal temperature.

So far, we have used the graphs to form an impression, but perhaps we need to make things more quantitative and formal, hence the use of the correlation coefficient.

The Pearson Correlation Coefficient

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References

The Pearson correlation coefficient is a numerical summary of the strength of linear relationship between two variables. It must lie between the values of −1 and +1. A value of 0 would suggest that there is no linear relationship between the two variables, a value of +1 says there is a perfect positive or direct linear relationship, −1 a perfect negative or indirect linear relationship. Figure 7 shows four examples, with different strengths of relationship between the RNA amounts for the different genes in a gene expression experiment. Above each panel is the correlation coefficient evaluated for that data set. We can see that for a high value of the correlation coefficient (0·9), there is a very clear linear pattern. For the moderate values (0·5 and −0·56), there is also a linear pattern, but for values close to 0, no linear pattern is apparent.

image

Figure 7. Scatterplots with different correlation strengths

Download figure to PowerPoint

The statistical significance of the correlation coefficient can be formally tested using a t-test, which will generate a P-value interpreted as we have seen before. The null hypothesis of this test would be that the two variables are not linearly related, and the alternative would claim that the two variables are linearly related. Generally speaking, correlation coefficient values of greater than 0·7 or less than −0·7 would be regarded as highly significant and practically important. A word of warning at this point: the significance of the formal test depends on the number of observations, so that for large sample sizes, small values of the correlation coefficient become statistically significant but may not be practically important.

Warning

the correlation coefficient is a measure of the strength of the linear association between two variables. If the relationship is nonlinear, the coefficient can still be evaluated and may appear sensible, but is uninformative so beware – plot the data first.

Examples of correlation coefficients are given below (all calculated in Minitab 16; Minitab Inc. – downloadable from http://www.minitab.com/en-GB/products/minitab/default.aspx)

Example 1: Trochlear notch sclerosis

Pearson correlation of TNS ratio and weight=−0·044, P-value=0·738 (Fig 1).

There is a non-significant correlation between TNS ratio and weight, so we would conclude that no relationship exists.

By age group (Fig 2)

Correlations between weight and TNS ratio for adults is 0·043, with a P-value=0·856

Correlations between weight and TNS ratio for juveniles is 0·132, with a P-value=0·549

Correlations between weight and TNS ratio for seniors is −0·395, with a P-value=0·104

There is a non-significant correlation between TNS ratio and weight in each age group, so we would conclude that no relationship exists. Interestingly, there is moderate correlation (−0·395) in the senior age group, which while not statistically significant (P-value of 0·104) may be of clinical relevance.

By sex (Fig 3)

Correlations between weight and TNS ratio for females is −0·045, with a P-value=0·817

Correlations between weight and TNS ratio for males is −0·082, with a P-value=0·655

There is a non-significant correlation between TNS ratio and weight in both males and females, so we would conclude that no relationship exists.

Example 2: Mastitic cows

Correlations between APP and SCC (Fig 4) is 0·590, with a P-value=0·000

Correlations between log(APP) and log(SCC) (Fig 5) is 0·577, with a P-Value=0·000

In this situation, the first correlation coefficient is moderate and statistically significant, but should not be trusted, the scatterplot in Fig 4 makes it clear that there is no (linear) relationship – this value is being artificially increased because of the few points that are scattered in the top right corner of the plot.

The correlation coefficient for the transformed data is statistically significant and with the scatterplot (Fig 5) makes clear that there is a linear relationship between log(APP) and log (SCC). It is an arithmetic “fluke” that the two values for the Pearson correlation (0·590 for the original data and 0·577 for the log transformed data) are numerically similar.

For the final example (Fig 6) the correlations between RECTAL TEMP, HEART RATE, RESP RATE are shown in the table below as well as the corresponding P-value (beneath the correlation coefficient value This is known as the correlation matrix.

 RECTAL TEMPHEART RATE
HEART RATE0·231 
 0·021 
RESP RATE0·3280·205
 0·0010·041

The correlation matrix shown above, shows three correlation coefficients (0·231, 0·328 and 0·205), these are weak to moderate, but they are all statistically significant (P-values of 0·021, 0·001 and 0·041, respectively, all <0·05). The linear relationships are real, but the statistical significance is driven by the large number of animals in the study (more than 300), so we would urge some caution in interpretation. The strength of the linear relationships are weak and may not be practically useful.

A footnote on correlation coefficients

The Spearman rank correlation coefficient, which is the nonparametric alternative to Pearson, gives a measure of the strength of a monotonic relationship between the two variables. A monotonic relationship is one which is not necessarily linear but which is strictly increasing. The Spearman coefficient is calculated based on the ranks of the data, not their actual values; it would be interpreted in similar ways and, indeed, if the scatterplot shows a linear relationship, then there would only be a small difference numerically between the Pearson and Spearman correlation coefficients.

Matched pairs

The final example we return to seems – in one way – very similar to all the previous examples, but is in fact an illustration of situation 1 mentioned previously – the matched pairs design, where we measure the same quantity (in this case heart rate) on two occasions on the same individual (Bell et al. 2011). Figure 8 shows heart rate measured on the same individual at timepoint 2 plotted against heart rate at timepoint 3. This shows quite clearly that there is a relationship between the two heart rate values, but this is a rather “poor” type of plot for this particular context (explained further below) so we need to consider what is different in this example, and how that difference should be reflected in the scatterplot and our analysis.

image

Figure 8. Scatterplot of heart rate at timepoints 2 and 3

Download figure to PowerPoint

The main difference is that there is only one variable of interest, heart rate – we would naturally assume that heart rates on the same individual will be related, hence the plot is not surprising in showing a relationship, and we could improve it.

What motivates the need for improvement is that the natural scientific question here is not “is heart rate at timepoint 2 associated with heart rate at timepoint 3?” but more sensibly “how does heart rate change from timepoint 2 to timepoint 3?”

VariableNN*Meanse MeansdMinimumQ1MedianQ3Maximum
Rate5904·562·2417·20−58·00−4·003·0016·0065·00

There are two ways that we could improve this plot: first, make sure that the axes scales are the same on the x and y axes, and second, we could consider placing the line of equality (the line y=x) on the plot to aid interpretation. Figure 9 shows the improved plot, and allows us to see that there is a strong relationship which is linear, but – equally importantly – we can also see that more points lie above the line than below, and this tells us that heart rate at timepoint 2 tends to be higher than heart rate at timepoint 3. This is another form of relationship (change from time 2 to time 3), and this particular design is an example of a matched pairs design. Further analysis of this example on heart rate would focus on looking at the difference in heart rate for each individual dog.

image

Figure 9. Scatterplot of heart rate at timepoints 2 and 3 including the line of equality

Download figure to PowerPoint

We can also calculate the correlation coefficient, and this gives a value of 0·776, with a P-value of 0·00. There is therefore a statistically significant correlation between heart rate at times 2 and 3, which is hardly surprising because the heart rates are from the same individual.

Since the question more naturally is about change, then we can use the tools we have already met: first, a dotplot (Fig 10) of the differences would be informative about the change along with some simple summary statistics.

image

Figure 10. Dotplot of differences between heart rate at timepoints 2 and 3

Download figure to PowerPoint

Descriptive statistics: heart rate difference

The mean change in heart rate is 4·56 beats, and if we carry out a t-test with the null hypothesis that the mean change is zero, the P-value is 0·046, with a 95% confidence interval for the mean change of 0·08 to 9·04 beats ([4·56−(2×2·24)] and [4·56+(2×2·24)];2·24 being the se mean). The fact that the confidence interval is entirely positive tells us that it is highly likely that the mean heart rate at timepoint 2 is higher than at mean timepoint 3 by between 0·08 and 9.04 beats.

Conclusions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References

Correlation coefficients are interesting but limited summary statistics which need to be backed up with scatterplots. Often the more natural question is not simply how strong is the relationship and is it linear but also can we describe the relationship statistically so that it could be used, e.g. for prediction. The next article will address the topic of building a regression (or linear model), where we have a natural order of the two variables, in that one variable is considered to be the response and the other the explanatory variable.

Conflict of interest

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References

None of the authors of this article has a financial or personal relationship with other people or organisations that could inappropriately influence or bias the content of the paper.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Scatterplots (Of All Varieties)
  5. The Pearson Correlation Coefficient
  6. Conclusions
  7. Conflict of interest
  8. References