# Statistics: using regression models

## Abstract

In a previous article, we asked the simple question “Are we related?” and used scatterplots and correlation coefficients to provide an answer. In this article, we will take this question and reword it to “How are we related?” and will demonstrate the statistical techniques required to reach a conclusion.

## INTRODUCTION

Previously (Scott et al. 2013), we investigated whether two variables might be related; now, we actually want to describe the relationship, to draw inferences about the nature of the relationship and to make predictions of one variable given the value of the other. In this setting, however, the two variables of interest are no longer treated equally: in our discussion of how we can answer this question, we will need to identify the response variable, which depends on an explanatory variable or covariate. We are first going to deal with the simplest case consisting of one response variable and one explanatory variable (simple linear regression), before we deal with one response variable and many explanatory variables (multiple regression). It is worth noting that the linear and multiple regression cases are examples of linear models, and this is something that we will return to in a subsequent article.

The first example we will use concerns the weight of a horse's heart. This has been suggested to be a reasonable indicator of its fitness and strength, but is obviously difficult to measure directly. However, it might be possible to estimate the weight indirectly from ultrasound measurements. Forty-six horses destined for euthanasia had echocardiography performed, and their hearts were then weighed post-mortem (O'Callaghan 1985). One of the ultrasound measurements recorded was the thickness, in centimetres, of the outer heart wall during systole (SOW), and direct measurement was made of the post-mortem weight of the heart (in kilograms).

Is it possible to successfully use an ultrasound measurement to predict the weight of the heart? The scatterplot (Fig 1) shows the measurement of heart weight at post-mortem (the response) with the SOW ultrasound measurement (the explanatory) taken on the 46 horses.

A natural question is “Is there a relationship between heart weight and SOW?” If the answer is “yes,” a second question is “Does it appear linear?” If echocardiography is to be a useful guide to determining heart weight in horses, we will be hoping that the answer to both questions is “yes.”

We will shortly illustrate how we might answer these questions. First, however, it will be useful to set the context of this problem a little more formally. The data take the form of a series of n points denoted by (xi, yi), where i=1,2,… n. In this case, x (the explanatory variable) would be systolic outer wall (SOW), y (the response variable) would be heart weight and n (the number of cases) would be 46. With this convention, we can then specify the model that we are going to build.

## SPECIFYING OUR MODEL

The model we intend to build is a straight line regression model. Let us remind ourselves of the equation for a straight line. This has two characteristics: a slope, or how steep the line is, and an intercept, or where the line crosses the y-axis (the vertical axis).

Mathematically, our model is defined by an equation that has parameters for each characteristic. These together describe the form of the dependence between y, the response variable, and x, the explanatory variable, and may be written as: yi=α+βxi. A second part introduces a random component, ɛ, because there is variability in the response or to put it another way, all the hearts that have a particular value of SOW are not the same weight. So, we write:

This equation defines a straight line with slope β and intercept α. Statisticians use Greek symbols to represent the parameters. In our example, we do not know the values for α and β, and so by fitting the model, we will find estimates for them. This equation captures the nature of the data, in that the data points (xi, yi) have a form that may be represented by a straight line defined by yi=α + βxi, but they also show quite a bit of scatter and ɛi gives a measure of that.

### Assumptions

Every statistical model makes a number of assumptions, and this model is not different. In fact, there are several assumptions that we need to be aware of and which we will need to check.

In simple terms, a basic linear regression model assumes:

• That a straight line is a good description of the relationship so that the average value of the response y is linearly related to the explanatory x,
• The spread of the response y about the average is the SAME for all values of x,
• The VARIABILITY of the response y about the average follows a NORMAL distribution for each value of x.

More formally, there are four key assumptions:

Assumption 1: the model equation α+βx fully captures the pattern of the relationship between y and x

Assumption 2: the ɛ terms have a mean value of 0

Assumption 3: the ɛ terms have constant variance σ2

Assumption 4: ɛ terms are uncorrelated, or in other words, each observation is independent of any other.

### Fitting the model

Finding the best values for α and β (the estimates) could be done in several ways, but the most commonly used is the method of least squares estimation. In this approach, we attempt to find the numerical values for α and β such that we minimize the squared vertical “deviations” between the observed data and the fitted line. This is the vertical distance between the calculated best-fitting line and the observed points. The purpose of this procedure is to obtain the line which is the “best” summary of the individual points, by finding the line that passes as closely as possible to all of them.

The estimates are usually written as and and our fitted model is given by the equation below:

The “hat” (^) symbol simply indicates that this is an estimate of the value of the population parameter. As we saw in earlier papers, given the data, we produce an estimate, because we have only looked at a small sample of all the possible points, in this case horses’ hearts. The result means that we can use this fitted model to make predictions or draw inference about the relationship.

For the cardiac ultrasound example, the equation of the model, fitted using Minitab® (Minitab 16; Minitab Inc. – downloadable from http://www.minitab.com/en-GB/products/minitab/default.aspx), is shown below:

The estimated intercept is −1·10 and the estimated slope is 1·16.

We can also plot the fitted model on the scatterplot as shown in Fig 2. The blue line shows the line y=−1·10+1·6 SOW. As we can see, this line passes through the cloud of points.

### Goodness of fit

How do we know whether this is a good model? Well the first thing we will explore is a couple of other statistics or summaries that are produced that can help us. They are the R2 value and the standard deviation.

The usual measure of the quality of fit of the linear model is the coefficient of determination, R2 (which is also the correlation squared). It is effectively the proportion of the variability in the response that can be explained by the regression on the explanatory variable. Typically, an R2 value greater then 0·7 (or 70%) indicates that the linear dependence of the response upon the explanatory is strong.

Commonly, the R2(adjusted) value is quoted. Adjusted for what? The value has been adjusted for the number of parameters in the model (without adjustment, increasing the number of parameters would improve the fit, so that comparing two models with different numbers of parameters would be biased towards that model with the larger number).

The standard deviation or “s” is also useful, because it tells us about the scatter of points about the fitted line. It is more difficult to interpret directly, but we can consider its size in relation to the range of values for y:

So, for the horses’ hearts, the adjusted R2 is 59·7%, which is moderate. The s value is 0·713 kg, which – given that the weight ranges for the hearts are between approximately 1·5 and 5 kg – is also quite large. These two statistics reflect that the points in the scatterplot in Fig 2 are widely scattered around the best-fitting straight line.

But there is more we need to consider, because our question was “Is this a good model?” and to answer that, we also need to consider whether the assumptions we have made are reasonable.

### Checking assumptions

So, an important part of the modelling process is to assess the validity of the assumptions upon which the model is based.

This is commonly done by examining residuals. The residuals are the differences between the observed values of the response and the values of the response that the model would expect corresponding to that value of the explanatory variable (Fig 3). The residuals are the estimated ɛ components that we introduced earlier.

So, for example, based on our model (heart weight=−1·10+1·16 SOW), a horse whose SOW is 2·4 would be estimated to have a heart weight of:

This actual horse had a heart that weighed 1·43 kg, so the residual is (1·43 to 1·68) or −0·25 kg.

Very often, it is standardized or studentized residuals that will be available from your statistical package (our examples use Minitab®, as detailed above). These quantities are the original residuals (ɛ) divided by their estimated standard deviation (i.e. a form of standardization which means that residuals at all points along the line can be compared). If the model assumptions are valid, the standardized residual values should be randomly scattered around 0 and should lie within ±2 standard deviations (approximately).

Some appropriate methods of assessing the model are to

• Plot the residuals against the fitted values, when patterns may become evident if the assumptions are not appropriate.
• Construct a probability plot of the residuals (as the residuals should be normally distributed, we would expect the normal probability plot to be approximately a straight line).

The first of these plots should show a RANDOM SCATTER of the points if the assumptions of linearity and constant variance are true. The second should be linear if the assumption of normality is true.

If the residual plots do not have the above properties, then this would indicate that the model assumptions are not justified or that the model structure is inadequate. Some common problems are (1) non-constant variance, (2) missing important explanatory variables and (3) inadequacy of the model.

The scatterplot in Fig 4, including the horizontal line at 0, shows roughly equal numbers of points above and below 0, and there is no obvious pattern other than a random scatter of points. (Note, some of our readers might not agree with this statement, but interpretation of residual plots is subjective!). In Fig 5, the blue line represents a straight line, and we are judging how well the points follow that straight line. Again (and quite commonly), we see that the majority of the points follow the line closely, but in the “tails,” there is more divergence. However, it should be fairly apparent that – broadly speaking – the points follow the straight line. Having had this discussion, we would suggest that neither plot shows any worrisome features, so we would conclude that the assumptions made are valid.

Having satisfied ourselves that the model is a reasonable fit, and that the assumptions are satisfied, then we can move on to the next important area: how to use the fitted model for inference.

### Inference

The main inferential questions are

1. Is the slope significantly different from zero (a question about a parameter commonly answered using a P-value from a hypothesis test)?
2. How significant overall is the model (again typically answered using a P-value from a hypothesis test, the ANOVA table)?
3. What is the mean weight of a heart when the ultrasound measurement has a specific value (answered using a confidence interval approach)?
4. What would be the weight of an individual heart given the measured SOW (asking a question about a specific individual and answering it using a prediction interval)?

The first question is whether the slope is significantly different from 0 (i.e. is β≠0). If the slope is not significantly different from zero, then this means that the explanatory variable does not have a statistically significant relationship with the response and can be removed.

The second question concerns whether the model performs better than the simple model that does not include the explanatory variable at all. In this simple case, these two questions are equivalent but, in our next paper, where we have more than one explanatory variable, these questions are addressing different issues.

The third and fourth questions concern inference about the model by constructing a 95% confidence interval (Scott et al. 2012) for the mean response for a given value of the explanatory variable and a 95% prediction interval for a future observation. The difference between these two intervals is rather subtle; the latter interval is subject to two sources of uncertainty: the uncertainty about the parameter values and secondly the scatter of the individuals in the population, captured in the variation in the points about the line, because it addresses an individual drawn from the population of all horses. The former interval asks about the population mean weight, so is not about an individual and so does not need to include this second component of uncertainty. As a result, prediction intervals that provide a range of plausible values for a future observation are always wider than the corresponding confidence interval for the population mean.

Now, we can consider each of these questions in turn using the output from our statistics package, Minitab®.

### Question 1

Is the slope significantly different from zero? Alternatively, we can word this as “Is the relationship between heart weight and SOW statistically significant?” To answer this question, we will use the output shown below that is extracted from the Minitab® output.

PredictorCoefSE CoefTP
Constant−1·09960·4186−2·630·012
SOW1·15760·14078·230·000

This shows us the estimated intercept (labelled “Constant”) and slope (labelled by the explanatory variable name, “SOW”), their estimated standard errors (SE Coef) and the corresponding P-value (we will ignore the column labelled T). If we start with the slope, the estimated value is 1·1576, its estimated error is 0·1407 and the associated P-value is 0·000. We can say that the slope is statistically significantly different from zero. Therefore, SOW is a useful explanatory variable in explaining the variation in weight.

We could answer a similar question concerning the intercept or constant, but this is rarely done, and often only in special circumstances.

### Question 2

How significant overall is the model? Again, we answer this using a table extracted from Minitab.

Analysis of Variance
SourceDFSS>MSFP
Regression134·45734·45767·720·000
Residual Error4422·3880·509
Total4556·845

This shows the analysis of variance (ANOVA) table; we need only focus on the P-value of 0·000. Because it is <0·05, we can immediately say, therefore, that our linear model including SOW is predictive of heart weight.

### Question 3

What is the mean weight of a heart when the ultrasound measurement has a specific value?

We could re-express this as, “Within what range does the weight of the heart lie, when the SOW is ‘2·4 cm’, for example?”

The simplest way to consider this question is to construct a confidence interval for weight or, in statistical notation for α+2·4β. As we saw in an earlier article (Scott et al. 2012), the form of calculation for the confidence interval is estimate ±2 estimated standard error. The form of the standard error for this quantity is rather complex, so instead we will approach the solution graphically by constructing (in Minitab®) the confidence bands around the line. These are shown in Fig 6.

Thus, if we wanted to know the 95% CI for the population mean heart weight for horses whose SOW is 2·4, then we project vertically upwards from SOW=2·4 and reading off the three values for weight where this vertical line crosses the solid fitted and dashed lines and shown below. So, we would conclude from the range of values that the mean weight lies within 1·43 to 1·93 kg.

Fitted Values
New ObsFitSE Fit95% CI
SOW=2·41·6790·162(1·427, 1·931)

### Question 4

What would be the weight of an individual heart be given the measured SOW? Again, we are asking (apparently) a very similar question, namely: what range would we predict the heart weight to lie within for this individual? Our best guess is 1·679 kg (as before), but now we need an interval that includes the variability of weight for individual horses whose SOW is 2·4 cm. This needs a prediction interval.

If we wanted a prediction interval, then again we can use a graphical approach. Figure 7 includes the confidence bands but also the prediction bands. Thus, if we wanted to know the 95% PI for the heart weight for a specific horse whose SOW is 2·4, then again we project vertically upwards from SOW=2·4 and reading off the two values for weight where this vertical line crosses the fine, outer dashed lines.

Predicted Values for New Observations
New ObsFitSE Fit95% CI95% PI
2·41·6790·125(1·427, 1·931)(0·219, 3·138)

The prediction interval is 0·22 to 3·13, so is very wide (certainly in relation to the overall range of heart weights for the group), meaning we are unsure about the weight of the heart for an individual horse with SOW of 2·4 (because of the large variability within the population in heart weight and also that the regression model including SOW is only moderate, i.e. weight is only partly determined by SOW). If we examine Fig 7 more closely, we will also see that, as the SOW decreases much below 2·4, the lower prediction interval line starts to include negative values for the heart weight. This is obviously biologically impossible, and is a result of the width of the prediction interval and also because, in the model fitting, we did not introduce any constraint that weight is always positive.

## OTHER USEFUL OUTPUT

Finally, there is frequently some other very interesting information produced routinely, again relating to how good a fit the model actually is. With some further diagnostics, we can detect observations that do not seem to fit the relationship well (very large residuals, denoted R, below), or observations that have played a very influential part in determining the fitted line (denoted X, below):

Unusual Observations
Horse no.SOWWeightFitSE FitResidualSt Resid
443·104·0102·4890·1101·5212·16R
464·803·4314·4570·290−1·026−1·57X
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.

For the heart weight example, horse number 44 (with an SOW of 3·1) is identified as having a heart weight that is far greater than the model would estimate (actual weight 4·01 and model estimate 2·5 kg). This represents a large residual, which might suggest an unusual observation. Horse number 46 is identified as one that has had considerable influence in determining the fitted model. This is because the SOW value is 4·8, which is the largest SOW value in the dataset, and so it “pulls” the line towards it. These two values are highlighted in Fig 8, and may suggest data points that require careful checking or further investigation.

### Conclusions

Building models is an important skill, but we also need to consider how to assess how well the model is performing and to be alert to odd or unusual observations, all of which we have covered for the very simplest model. This short article is the first in which we have tackled this topic, and this will form the basis of a further two articles where we address more than one explanatory variable. In many regression situations, we have more than one possible explanatory variable and we must consider which variables to include and how to deal with any relationships between the explanatory variables (collinearity). Returning to the horse's heart example detailed in this article, the full set of measurements were as follows: thickness of the outer heart wall during systole (SOW) and during diastole (DOW); thickness of the inner wall during systole (SIW) and during diastole (DIW) and exterior width during systole (SEW) and during diastole (DEW). How well could we build a model using all or only some of these measurements? This is known as multiple regression, a topic that will be addressed in subsequent articles.

In addition, regression models have classically included only continuous variables, but we might naturally want to mix continuous and categorical variables, and for this we will want to consider the general linear model, which is a modern and very powerful modelling tool. For example, we might wish to include sex or breed as a further explanatory variable in our quest to explain the heart weight of horses. General linear models will also be the topic of a further article.

### Conflict of interest

None of the authors of this article has a financial or personal relationship with other people or organizations that could inappropriately influence or bias the content of the paper.