Statistics for clinicians: An introduction to linear regression

Authors

Katherine J Lee,

Corresponding author

Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Melbourne, Victoria, Australia

Department of Paediatrics, The University of Melbourne, Melbourne, Victoria, Australia

Correspondence: Dr Katherine J Lee, Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Vic. 3052, Australia. Fax: (03) 9345 6000; email: katherine.lee@mcri.edu.au

In 2000–2003, the Journal of Paediatrics and Child Health published a series of articles that gave an overview of some common statistical concepts and methods, such as standard errors, confidence intervals[1] and hypothesis tests including the t-test,[2] aimed at a clinical audience.[3] In this paper, we continue in the same vein and present an introduction to linear regression, a commonly used technique to assess the association between two or more variables.

The previous articles used the example of a comparison of verbal IQ at age 5 years between a group of children who were born with extremely low birthweight (<1000 g) and another group of children with birthweight 1000–1500 g (see fig. 3 in Carlin and Doyle[1]). Another natural question that arises from these data is how to represent the relationship between IQ at age 5 years and birthweight without categorising the latter.

This article introduces the concept of linear regression, where we fit a straight line to represent the relationship between two continuous variables, specifically addressing the question of how one variable may be predicted from the other. We describe the process of fitting a linear regression model and the interpretation of the parameters used to define the regression model. We also explain the close connection between linear regression and the t-test described in the previous series of papers[2] when the exposure variable (birthweight in the above example) is a binary variable, such as birthweight categorised as <1000 g and 1000–1500 g. We begin with the simple scenario in which we are interested in the relationship between the outcome variable and just one ‘independent’ exposure or predictor variable, although later in the paper we discuss the concept of multivariable regression, where we are interested in the relationship of the outcome to multiple predictors, for example IQ at age 5 years predicted by both birthweight and gender.

Correlation

As recommended previously,[4] it is best to begin with an examination of the relationship of interest visually. Figure 1 shows a scatter plot of birthweight (the predictor) and IQ at age 5 years (the outcome) for the 138 children used for illustration in the previous series. From this plot we can get a sense of whether it is reasonable to assume a linear or other smooth relationship between the two variables, and importantly, we can check for unusual features such as outliers of either of the two variables.

One way to describe the association between two continuously scaled variables such as IQ and birthweight is to report the correlation coefficient, which is a measure of the strength of the linear relationship between two variables. Correlation coefficients are measured on a scale from +1, which represents a perfect positive (increasing) linear relationship, to −1, which represents a perfect negative (decreasing) linear relationship. A correlation of 0 signifies that there is no linear relationship between the two variables. Correlation describes the strength of the (linear) association between two variables in a symmetrical way (i.e. changing the labels X and Y gives the same correlation), but often we are interested in more specific questions, such as whether we can predict the value of one variable (the ‘outcome’ or ‘dependent’ variable) from the other (the ‘predictor’ or ‘exposure’ or ‘independent’ variable). In our example we might ask how much IQ tends to increase for each 100-g increase in birthweight. This leads us to the concept of linear regression.

Simple Linear Regression

Basic principles

The concept of (simple) linear regression refers to the fitting of a statistical model to describe how one variable (the predictor or independent variable) can be used to predict another variable (the outcome or dependent variable). A linear regression model consists of two parts: the equation for the best-fitting line through the data (as shown in Fig. 1) and an error term representing the variation around the linear trend. We express the linear regression model for the ith individual or unit in the sample of interest as follows:

yi=α+βxi+ei(1)

where y_{i} is the outcome value and x_{i} is the value of the predictor for the ith individual, while α and β represent the parameters (quantities to be estimated) in the model. The intercept, α, represents the expected (or average) value of the outcome when the value of the predictor is 0. The variable β, which is equivalent to the slope of the line, represents the average change in outcome per one-unit increase in the predictor, which in our example represents how much IQ increases (on average) for each 1-g increase in birthweight. (Note that the interpretation of β depends on the units used for each of the variables.) Finally, e_{i} is the random error term, which represents how much individual i varies from the regression line (i.e. the difference between the observed value of y_{i} and the value expected or predicted by the model). Finally, we assume that the error terms have a mean of 0 and a constant variance, so that, on average, the outcome values fall on the regression line, with σ^{2} describing how much they vary around the regression line.

In order to estimate the parameters in the regression model (α and β), we use a method known as least-squares. Under this approach, we choose the line (i.e. values of α and β) that best fits the data in the sense of minimising the sum of the (squared) deviations of each y-value from the line. This idea is illustrated in Figure 2. We sum the squared deviations rather than the deviations themselves so that all values being summed are positive. The estimates produced by this method are known as least-squares estimates, and there is some elegant mathematics that describes their properties and how they can be calculated. See Kutner et al.[5] for a more detailed description of linear regression models and least-squares estimation.

Using the least-squares approach, the fitted regression line in Figure 3 turns out to be

E(IQi)=83.370+0.014×birthweighti(2)

where E(IQ_{i}) represents the expected value of IQ given birthweight (note there is no error term in this equation, as we have now written the equation for the expected value of the outcome, i.e. the points on the fitted regression line). From this regression equation, IQ increases on average by 0.014 points per gram increase in birthweight. A more useful way of expressing this might be as 1.4 points per 100-gram increase in birthweight. Using this equation we can estimate the expected IQ at age 5 years for any given birthweight. Note that the intercept in a regression model is not always meaningful; in our example, the intercept represents the estimated IQ in infants with a birthweight of 0! More generally, using the regression line for prediction beyond the sample range of the predictor is not recommended.

Inference

In estimating the parameters of the regression model, we are often interested in making a statistical inference, as we will be using our data to make claims or predictions about the relationship between two variables that we hope can be generalised to the population from which our sample of individuals has been taken. The estimated parameter values from our sample are not the true values in the population, so it is important to quantify the uncertainty in our estimates. This requires the key concept of sampling variability that was introduced in Carlin and Doyle[1] and enables us to quantify how much the estimates of our parameters would vary if we repeatedly drew samples from our target population.

As described previously for simple means and proportions, the sampling variability is captured by the standard errors of the estimates, in this case the estimates of α and β. These standard errors can be used to obtain a confidence interval for each parameter, which represents an interval for the true value that is consistent with the data in the sense of having a specified probability of capturing the true value in repeated samples. For example, we might report that the IQ at age 5 years increases by 0.014 points per gram increase in birthweight, with a 95% confidence interval from 0.004 to 0.024 (Table 1).

Table 1. Results from the linear regression of birthweight as a predictor of IQ at age 5 years

Estimate

95% confidence interval

P value

Intercept (α)

83.370

72.354 to 94.385

<0.001

Slope (β)

0.014

0.004 to 0.024

0.006

Finally, we may wish to carry out a hypothesis test to provide an indication of the strength of the evidence against the null hypothesis that there is no relationship between the predictor and outcome variables (i.e. β = 0). This involves assessing how frequently estimated slopes as large as ours would arise purely by chance with the given sample size if there truly were no relationship. If this chance is very small, as evidenced by a small P value, then there is strong evidence against the null hypothesis of no relationship. In our example, the P value was 0.006, so our data were unlikely to arise from a population in which there was no relationship between birthweight and IQ at 5 years. Although this constitutes evidence that birthweight is systematically related to or predictive of IQ at age 5 years, it should be noted that it does not directly tell us whether the relationship is well represented by a straight line; this needs to be checked, as discussed next.

Assumptions

In fitting a linear regression model, we implicitly make the assumption that the relationship between Y and X is linear, that is, that Y increases/decreases as X increases in a linear (straight-line) fashion. In practice this may not always be the case. Before fitting such a model it is important to consider whether this is a plausible assumption, for example using a scatter plot similar to that shown in Figure 1.

When using our model to make inference about a population, we make a number of additional assumptions:

The observations are independent; for example, each pair of observations for X and Y are from unrelated individuals.

The errors have a constant variance; that is, the variability around the regression line is constant along the length of the regression line.

The values of the outcome Y follow a normal distribution for each value of the predictor X.

When using linear regression, it is important to check that these assumptions are realistic to ensure that the resulting inferences are valid. The first assumption can be determined from the study design and the nature of the data to be analysed. A scatter plot, similar to that shown in Figure 1, can be used to informally assess the latter assumptions. There are also a number of more formal diagnostics that have been proposed to assess the validity of these assumptions (see Section 11 in Altman[6] for more details). It is worth noting that some of the assumptions are much more important than others; for example, assumption 3 is usually unimportant as long as the sample size is not too small.[7]

Categorical Predictors

The association between a categorical or grouped predictor and a continuous outcome is naturally represented by the differences in mean outcome between the categorical groups defined by the predictor. It is not hard to see that this is equivalent to using a linear regression model in which the X variable (the predictor) represents the categories of the grouped variable. For example, if our predictor has just two categories (a binary predictor), fitting a linear regression model for outcome Y on (binary) predictor X is equivalent to a comparison of the mean outcome in the two groups defined by X. For illustration we return to the example from Carlin and Doyle[2] where birthweight is categorised as <1000 g and 1000–1500 g and we are interested in whether birthweight group is predictive of IQ at age 5 years. In this case, our predictor X is an indicator denoting which of the two groups each person belongs to:

Thinking in terms of linear regression, we choose the line of best fit between our predictor X and outcome Y (Fig. 3) using Equation 1. With a binary predictor, the fitted regression line (minimising the sum of squared deviations) simply joins the mean in one group to the mean in the second group. In this context, the interpretation of α becomes the mean in the group where X = 0, in our case the mean IQ score in individuals with birthweight 1000–1500 g. The variable β then represents the difference in the mean outcome of Y between the two groups defined by X (coded to be one unit of X apart) – in our example, the difference in the mean outcome between those with birthweight <1000 g compared with those with a birthweight 1000–1500 g. Thus the linear regression representation of the relationship between birthweight group and IQ at age 5 years is the equation:

E(IQi)=101.7−8.3×(Ibirthweight<1000g)i(4)

From this equation we see that, on average, IQ at age 5 years is 8.3 points lower in children with birthweight <1000 g compared with children with birthweight 1000–1500 g. We note that the same interpretation of β exists when X is a continuous predictor, in that β represents the difference in the mean outcome at two values of X that are one unit apart (assuming that the relationship between X and Y is linear).

Using linear regression to model the association between a binary predictor and a continuous outcome as described above compares the mean outcome in two groups. This is precisely equivalent to carrying out an unpaired (equal-variance) t-test to compare the mean in two groups as described in Carlin and Doyle.[2]

Similarly, when a predictor X has more than two categories, we compare the outcome in each category of the predictor to the outcome in a reference category. This is equivalent to a linear regression model with multiple indicators representing the categories of the grouped variable (see Chapter 8 in Kutner et al.[5] for more details).

Multiple Regression

When assessing the association between an exposure (predictor) and an outcome, it is often important to adjust for potential confounding factors that can bias the exposure–outcome relationship. For example, we might want to establish the relationship between IQ at 5 years and birthweight while taking into account that socioeconomic status (SES) influences birthweight and also predicts IQ, so confounding the relationship of interest. Multiple regression models are powerful tools for tackling such problems. Essentially, the regression model allows us to assess the relationship between birthweight and IQ while holding SES constant. We achieve this by including the confounding variable as another predictor in the regression model. For example, if we want to examine the relationship between exposure X and outcome Y, adjusted for confounder Z, we can use the model

yi=α+βxi+γzi+ei(5)

In this model, α represents the mean outcome when X = 0 and Y = 0, and β represents the relationship of covariate X on outcome Y, adjusted for Z. That is, β represents the average change in Y per one-unit increase in X when Z is held constant. (Note that this fundamentally changes the meaning of β compared with the ‘unadjusted’ model in Equation 1). Similarly, γ represents the average change in Y per one unit increase in Z when X is held constant.

The model can be further extended to include any finite number of predictors:

yi=α+β1x1i+β2x2i+…+βjxji+ei(6)

where j is the number of predictors in the model and X_{1}_{i} … X_{ji} are the values of the j predictors for the ith individual. In this model, α represents the mean outcome when all of the predictors (X_{1}… X_{j}) are 0, and β_{j} represents the average change in outcome Y per unit increase in predictor X_{j} when all of the other predictors are held constant. As with simple linear regression, we can use this equation to obtain a prediction for Y based on the values of covariates X_{1}_{i} … X_{j}. In fact, multiple regression models may be developed for a number of purposes, including prediction, in addition to their use in adjusting for confounding, and the best approach to developing a model specification (choice and coding of X variables) will generally depend on the intended purpose of the model. For further information on multiple regression see Section 11 in Kirkwood and Sterne.[8]

Acknowledgements

This work was supported from a Centre of Research Excellence grant awarded to JBC and colleagues from the Australian National Health and Medical Research Council. The authors also acknowledge support provided by the Murdoch Childrens Research Institute through the Victorian Government's Operational Infrastructure Support Program.