The model we intend to build is a straight line regression model. Let us remind ourselves of the equation for a straight line. This has two characteristics: a slope, or how steep the line is, and an intercept, or where the line crosses the y-axis (the vertical axis).
Mathematically, our model is defined by an equation that has parameters for each characteristic. These together describe the form of the dependence between y, the response variable, and x, the explanatory variable, and may be written as: yi=α+βxi. A second part introduces a random component, ɛ, because there is variability in the response or to put it another way, all the hearts that have a particular value of SOW are not the same weight. So, we write:
This equation defines a straight line with slope β and intercept α. Statisticians use Greek symbols to represent the parameters. In our example, we do not know the values for α and β, and so by fitting the model, we will find estimates for them. This equation captures the nature of the data, in that the data points (xi, yi) have a form that may be represented by a straight line defined by yi=α + βxi, but they also show quite a bit of scatter and ɛi gives a measure of that.
Every statistical model makes a number of assumptions, and this model is not different. In fact, there are several assumptions that we need to be aware of and which we will need to check.
In simple terms, a basic linear regression model assumes:
- That a straight line is a good description of the relationship so that the average value of the response y is linearly related to the explanatory x,
- The spread of the response y about the average is the SAME for all values of x,
- The VARIABILITY of the response y about the average follows a NORMAL distribution for each value of x.
More formally, there are four key assumptions:
Assumption 1: the model equation α+βx fully captures the pattern of the relationship between y and x
Assumption 2: the ɛ terms have a mean value of 0
Assumption 3: the ɛ terms have constant variance σ2
Assumption 4: ɛ terms are uncorrelated, or in other words, each observation is independent of any other.
Fitting the model
Finding the best values for α and β (the estimates) could be done in several ways, but the most commonly used is the method of least squares estimation. In this approach, we attempt to find the numerical values for α and β such that we minimize the squared vertical “deviations” between the observed data and the fitted line. This is the vertical distance between the calculated best-fitting line and the observed points. The purpose of this procedure is to obtain the line which is the “best” summary of the individual points, by finding the line that passes as closely as possible to all of them.
The estimates are usually written as and and our fitted model is given by the equation below:
The “hat” (^) symbol simply indicates that this is an estimate of the value of the population parameter. As we saw in earlier papers, given the data, we produce an estimate, because we have only looked at a small sample of all the possible points, in this case horses’ hearts. The result means that we can use this fitted model to make predictions or draw inference about the relationship.
The estimated intercept is −1·10 and the estimated slope is 1·16.
We can also plot the fitted model on the scatterplot as shown in Fig 2. The blue line shows the line y=−1·10+1·6 SOW. As we can see, this line passes through the cloud of points.
Figure 2. Scatterplot of heart weight (kg) versus cardiac systolic outer wall thickness (cm) in 46 horses. Blue line shows y=−1·10+1·6 SOW
Download figure to PowerPoint
Goodness of fit
How do we know whether this is a good model? Well the first thing we will explore is a couple of other statistics or summaries that are produced that can help us. They are the R2 value and the standard deviation.
The usual measure of the quality of fit of the linear model is the coefficient of determination, R2 (which is also the correlation squared). It is effectively the proportion of the variability in the response that can be explained by the regression on the explanatory variable. Typically, an R2 value greater then 0·7 (or 70%) indicates that the linear dependence of the response upon the explanatory is strong.
Commonly, the R2(adjusted) value is quoted. Adjusted for what? The value has been adjusted for the number of parameters in the model (without adjustment, increasing the number of parameters would improve the fit, so that comparing two models with different numbers of parameters would be biased towards that model with the larger number).
The standard deviation or “s” is also useful, because it tells us about the scatter of points about the fitted line. It is more difficult to interpret directly, but we can consider its size in relation to the range of values for y:
So, for the horses’ hearts, the adjusted R2 is 59·7%, which is moderate. The s value is 0·713 kg, which – given that the weight ranges for the hearts are between approximately 1·5 and 5 kg – is also quite large. These two statistics reflect that the points in the scatterplot in Fig 2 are widely scattered around the best-fitting straight line.
But there is more we need to consider, because our question was “Is this a good model?” and to answer that, we also need to consider whether the assumptions we have made are reasonable.
So, an important part of the modelling process is to assess the validity of the assumptions upon which the model is based.
This is commonly done by examining residuals. The residuals are the differences between the observed values of the response and the values of the response that the model would expect corresponding to that value of the explanatory variable (Fig 3). The residuals are the estimated ɛ components that we introduced earlier.
Figure 3. Scatterplot of heart weight (kg) versus cardiac systolic outer wall thickness (cm) in 46 horses. Blue line shows y=−1·10+1·6 SOW. Examples of residuals are demonstrated
Download figure to PowerPoint
So, for example, based on our model (heart weight=−1·10+1·16 SOW), a horse whose SOW is 2·4 would be estimated to have a heart weight of:
This actual horse had a heart that weighed 1·43 kg, so the residual is (1·43 to 1·68) or −0·25 kg.
Very often, it is standardized or studentized residuals that will be available from your statistical package (our examples use Minitab®, as detailed above). These quantities are the original residuals (ɛ) divided by their estimated standard deviation (i.e. a form of standardization which means that residuals at all points along the line can be compared). If the model assumptions are valid, the standardized residual values should be randomly scattered around 0 and should lie within ±2 standard deviations (approximately).
Some appropriate methods of assessing the model are to
- Plot the residuals against the fitted values, when patterns may become evident if the assumptions are not appropriate.
- Construct a probability plot of the residuals (as the residuals should be normally distributed, we would expect the normal probability plot to be approximately a straight line).
The first of these plots should show a RANDOM SCATTER of the points if the assumptions of linearity and constant variance are true. The second should be linear if the assumption of normality is true.
If the residual plots do not have the above properties, then this would indicate that the model assumptions are not justified or that the model structure is inadequate. Some common problems are (1) non-constant variance, (2) missing important explanatory variables and (3) inadequacy of the model.
The scatterplot in Fig 4, including the horizontal line at 0, shows roughly equal numbers of points above and below 0, and there is no obvious pattern other than a random scatter of points. (Note, some of our readers might not agree with this statement, but interpretation of residual plots is subjective!). In Fig 5, the blue line represents a straight line, and we are judging how well the points follow that straight line. Again (and quite commonly), we see that the majority of the points follow the line closely, but in the “tails,” there is more divergence. However, it should be fairly apparent that – broadly speaking – the points follow the straight line. Having had this discussion, we would suggest that neither plot shows any worrisome features, so we would conclude that the assumptions made are valid.
Figure 4. Scatterplot of standardized residuals versus fitted values, demonstrating random scattering of points and an approximately equal number of points above and below the 0 line
Download figure to PowerPoint
Having satisfied ourselves that the model is a reasonable fit, and that the assumptions are satisfied, then we can move on to the next important area: how to use the fitted model for inference.
The main inferential questions are
- Is the slope significantly different from zero (a question about a parameter commonly answered using a P-value from a hypothesis test)?
- How significant overall is the model (again typically answered using a P-value from a hypothesis test, the ANOVA table)?
- What is the mean weight of a heart when the ultrasound measurement has a specific value (answered using a confidence interval approach)?
- What would be the weight of an individual heart given the measured SOW (asking a question about a specific individual and answering it using a prediction interval)?
The first question is whether the slope is significantly different from 0 (i.e. is β≠0). If the slope is not significantly different from zero, then this means that the explanatory variable does not have a statistically significant relationship with the response and can be removed.
The second question concerns whether the model performs better than the simple model that does not include the explanatory variable at all. In this simple case, these two questions are equivalent but, in our next paper, where we have more than one explanatory variable, these questions are addressing different issues.
The third and fourth questions concern inference about the model by constructing a 95% confidence interval (Scott et al. 2012) for the mean response for a given value of the explanatory variable and a 95% prediction interval for a future observation. The difference between these two intervals is rather subtle; the latter interval is subject to two sources of uncertainty: the uncertainty about the parameter values and secondly the scatter of the individuals in the population, captured in the variation in the points about the line, because it addresses an individual drawn from the population of all horses. The former interval asks about the population mean weight, so is not about an individual and so does not need to include this second component of uncertainty. As a result, prediction intervals that provide a range of plausible values for a future observation are always wider than the corresponding confidence interval for the population mean.
Now, we can consider each of these questions in turn using the output from our statistics package, Minitab®.
Is the slope significantly different from zero? Alternatively, we can word this as “Is the relationship between heart weight and SOW statistically significant?” To answer this question, we will use the output shown below that is extracted from the Minitab® output.
This shows us the estimated intercept (labelled “Constant”) and slope (labelled by the explanatory variable name, “SOW”), their estimated standard errors (SE Coef) and the corresponding P-value (we will ignore the column labelled T). If we start with the slope, the estimated value is 1·1576, its estimated error is 0·1407 and the associated P-value is 0·000. We can say that the slope is statistically significantly different from zero. Therefore, SOW is a useful explanatory variable in explaining the variation in weight.
We could answer a similar question concerning the intercept or constant, but this is rarely done, and often only in special circumstances.
How significant overall is the model? Again, we answer this using a table extracted from Minitab.
|Analysis of Variance|
|Residual Error||44||22·388||0·509|| || |
|Total||45||56·845|| || || |
This shows the analysis of variance (ANOVA) table; we need only focus on the P-value of 0·000. Because it is <0·05, we can immediately say, therefore, that our linear model including SOW is predictive of heart weight.
What is the mean weight of a heart when the ultrasound measurement has a specific value?
We could re-express this as, “Within what range does the weight of the heart lie, when the SOW is ‘2·4 cm’, for example?”
The simplest way to consider this question is to construct a confidence interval for weight or, in statistical notation for α+2·4β. As we saw in an earlier article (Scott et al. 2012), the form of calculation for the confidence interval is estimate ±2 estimated standard error. The form of the standard error for this quantity is rather complex, so instead we will approach the solution graphically by constructing (in Minitab®) the confidence bands around the line. These are shown in Fig 6.
Figure 6. Scatterplot of heart weight (kg) versus cardiac systolic outer wall thickness (cm) in 46 horses. The fitted line and confidence interval bands are highlighted. The example shown demonstrates the confidence intervals for a horse with a SOW of 2·4 cm
Download figure to PowerPoint
Thus, if we wanted to know the 95% CI for the population mean heart weight for horses whose SOW is 2·4, then we project vertically upwards from SOW=2·4 and reading off the three values for weight where this vertical line crosses the solid fitted and dashed lines and shown below. So, we would conclude from the range of values that the mean weight lies within 1·43 to 1·93 kg.
|New Obs||Fit||SE Fit||95% CI|
What would be the weight of an individual heart be given the measured SOW? Again, we are asking (apparently) a very similar question, namely: what range would we predict the heart weight to lie within for this individual? Our best guess is 1·679 kg (as before), but now we need an interval that includes the variability of weight for individual horses whose SOW is 2·4 cm. This needs a prediction interval.
If we wanted a prediction interval, then again we can use a graphical approach. Figure 7 includes the confidence bands but also the prediction bands. Thus, if we wanted to know the 95% PI for the heart weight for a specific horse whose SOW is 2·4, then again we project vertically upwards from SOW=2·4 and reading off the two values for weight where this vertical line crosses the fine, outer dashed lines.
|Predicted Values for New Observations|
|New Obs||Fit||SE Fit||95% CI||95% PI|
|2·4||1·679||0·125||(1·427, 1·931)||(0·219, 3·138)|
Figure 7. Scatterplot of heart weight (kg) versus cardiac systolic outer wall thickness (cm) in 46 horses. The fitted line, confidence interval and prediction interval bands are highlighted. The example shown demonstrates the prediction intervals for a horse with a SOW of 2·4 cm
Download figure to PowerPoint
The prediction interval is 0·22 to 3·13, so is very wide (certainly in relation to the overall range of heart weights for the group), meaning we are unsure about the weight of the heart for an individual horse with SOW of 2·4 (because of the large variability within the population in heart weight and also that the regression model including SOW is only moderate, i.e. weight is only partly determined by SOW). If we examine Fig 7 more closely, we will also see that, as the SOW decreases much below 2·4, the lower prediction interval line starts to include negative values for the heart weight. This is obviously biologically impossible, and is a result of the width of the prediction interval and also because, in the model fitting, we did not introduce any constraint that weight is always positive.