Statistical guidelines for clinical studies of human vision


Richard A Armstrong
E-mail address:


Citation information: Armstrong RA, Davies LN, Dunne MCM & Gilmartin B. Statistical guidelines for clinical studies of human vision. Ophthalmic Physiol Opt 2011, 31, 123–136. doi: 10.1111/j.1475-1313.2010.00815.x


Statistical analysis of data can be complex and different statisticians may disagree as to the correct approach leading to conflict between authors, editors, and reviewers. The objective of this article is to provide some statistical advice for contributors to optometric and ophthalmic journals, to provide advice specifically relevant to clinical studies of human vision, and to recommend statistical analyses that could be used in a variety of circumstances. In submitting an article, in which quantitative data are reported, authors should describe clearly the statistical procedures that they have used and to justify each stage of the analysis. This is especially important if more complex or ‘non-standard’ analyses have been carried out. The article begins with some general comments relating to data analysis concerning sample size and ‘power’, hypothesis testing, parametric and non-parametric variables, ‘bootstrap methods’, one and two-tail testing, and the Bonferroni correction. More specific advice is then given with reference to particular statistical procedures that can be used on a variety of types of data. Where relevant, examples of correct statistical practice are given with reference to recently published articles in the optometric and ophthalmic literature.


Computer software employing a wide range of data analysis methods is more widely available to clinical scientists studying human vision than at any previous time. The availability of this software, however, makes it essential that investigators apply data analysis methods appropriately. Statistical analysis of data can be complex, with many possible methods of approach, each of which applies in a particular experimental circumstance. Hence, it is not uncommon for inappropriate statistical procedures to be used, for the methods not to be clearly described, or the results of statistical tests to be misinterpreted. In the optometric and ophthalmic literature, common statistical problems include, for example, application of a relatively limited statistical analysis, such as the ‘t’ test when the data would have been appropriate for a factorial analysis of variance (anova) or confusion as to which model of anova is appropriate.1 A particularly common problem is authors not describing their statistical methods clearly enough to enable a reviewer to judge whether a particular method is valid. This situation is not helped by a confusion of terminology surrounding some statistical tests especially different forms of anova. For example, there are at least six different ways of describing a simple two-way anova and this confusion will be discussed in the relevant section.

The purpose of this article is to provide basic statistical advice for authors carrying out clinical studies of human vision, and without being too prescriptive, to recommend relevant statistical analyses in a variety of experimental circumstances. In submitting an article to an optometric or ophthalmic journal, in which quantitative data are reported, authors should be expected to describe clearly the statistical procedures that they have used and to justify each stage of the analysis. This is especially important if more complex or ‘non-standard’ analyses have been carried out. The article begins with some general comments relating to data analysis and then more specific advice is given with reference to particular statistical procedures. The recommended statistical analyses applicable to many commonly encountered circumstances are summarised in Table 1. This article only covers ‘general statistical advice’ and no specific guidelines are given relating to more specialised procedures such as epidemiology (in which authors are referred to the book by Katz),2 validation of questionnaires,3,4 and ‘meta-analysis’, and many of these topics will be the subject of future articles.

Table 1.   Recommended statistical procedures for a variety of different types of data and variables. The first column lists the type of data to be analysed and the second and third columns the parametric and alternative non-parametric statistical procedures respectively relevant to the respective data
Form of the dataPossible statistical procedures
  1. anova, Analysis of variance; CI, confidence interval; F, variance ratio; χ2, chi-square; Fo, observed frequency; FA, factor analysis; KS, Kolmogorov-Smirnov test; PCA, principal components analysis; ‘r’, Pearson’s correlation coefficient; ‘r2’, coefficient of determination; rs, Spearman’s rank correlation; ‘R’, multiple correlation coefficient; R × C, rows times columns contingency table; S.D., standard deviation; S.E.M., Standard error of the mean; τ, Kendall’s tau; ‘x*’, sample mean.

A single observation ‘x’Is ‘x’ a member of a specific population?
A sample of ‘x’ valuesConstruct frequency distribution, Calculate x*, S.D., S.E.M., CI Is X normally distributed?Mode, Median, 95th percentile
Two independent samples (x1, x2)Unpaired t test‘Mann–Whitney’U test
Two paired samples (x1- x2)Paired t test‘Wilcoxon’ signed ranks rest
Two sets of measurements using two methodsTest of agreement: Bland & Altman
Three or more independent discrete groups (x1, x2… xn)1-way anova, randomised design, fixed effects model‘Kruskal–Wallis’ test
Three or more independent groups, random variable (x1, x2… xn)1-way anova, randomised design, random effects model
Three or more dependent groups in ‘blocks’2-way anova, randomised blocks‘Friedmann’s’ test
Two or more factors, completely randomisedFactorial anova, randomised design
Two or more factors (‘major’ and ‘minor’ factor)Factorial anova, split-plot design 
Two or more factors, one of which is ‘time’Factorial anova, repeated measures design
Two variances (s1,s2)Compare by 2-Tail F test
Three or more variances (s1, s2… sn)‘Bartlett’s’ test
‘Levene’s’ test
‘Brown-Forsythe’ test
Two or more frequencies Fo (Single variable)χ2 goodness of fit
2 × 2 contingency table (Two variables)Fe > 5, χ2Fe < 5, Fishers exact test, McNemar’s test
R × C contingency table (Two variables)Fe > 5, χ2Fe < 5, KS test
Two variables (X, Y) (Linear)Pearson’s r, r2Spearman’s rs Kendall’s τ Gamma
Linear regression(r2, anova, t)
Two variables (Two or more samples)Compare regression lines, Analysis of covariance
Two variables (X, Y) (Non-linear)Transform to linear
Fit polynomial in X
Non-linear estimation
Y and two or more X Variables (Y, X1, X2, … Xn)Multiple regression, R2, stepwise procedure
Several X variables (no Y)PCA/FA 

General advice

Sample size and power

The most critical problem in testing hypotheses is the possibility of making a Type 1 error, i.e., rejecting the null hypothesis (Ho) when it is true. By contrast, a Type 2 error is accepting the Ho when a real difference is present. Hence, there are two important questions that should be asked about any study. First, before the study is carried out, what sample size (N) would it be appropriate to use to estimate a quantity or detect a certain ‘effect’ to reduce the likelihood of a Type 1 error? Second, what is the strength or ‘power’ of an experiment that has been conducted, i.e., what difference between two or more groups was the study actually capable of detecting? The second question is of particular importance because an investigation in which a non-significant difference between two groups is reported confirms the null hypothesis (Ho). This may not mean, however, that the Ho should actually be rejected because the experiment may have been too small to detect the ‘true’ difference and this is an example of a Type 2 error.

In any hypothesis test, the statistical method, e.g., a ‘t’ or ‘F’ test, indicates the probability of a result if the Ho were actually true and therefore, if that probability is less than 5% by convention (p < 0.05), the hypothesis is usually rejected. Incidentally, p values should be quoted using two methods, either by giving the actual p value (preferred) or the range of p can be indicated as follows (*p < 0.05, **p < 0.01, ***p < 0.001).

The ability of an experiment to reject the hypothesis depends on a number of factors including the probability chosen to reject the Ho (p = 0.05), the variability of the measurements, the sample size since larger values of ‘N’ lead to more accurate estimates of statistical parameters, and the effect size, i.e., the size of the actual effect in the population, larger effects being easier to detect. Statistical software (such as GPower)5 is now widely available to calculate ‘p’ and to estimate ‘N’ in a variety of circumstances including comparison of two or more groups, factorial anova, and in correlation and regression studies and should be consulted where relevant. Authors should consider the problems of sample size and the power of their study and to have an appreciation of the limitations of this information.

Test of a hypothesis or estimating a quantity

Most studies pose one of two statistical questions. First, is a test of a hypothesis, e.g., does practice in fluent reading train eye movements that result in a good ‘Development Eye Movement’ (DEM) score?6 The answer to this question will be either ‘yes’ or ‘no’ and an experiment is designed to elicit this answer. By convention, we prefer to believe the Ho that there is no effect of the training on DEM score until the experiment proves otherwise. The second type of question involves the estimation of a quantity. It may be established that practice in fluent reading train eye movements that result in a good DEM score and an experiment may be designed to quantify the magnitude of this effect in a particular age group. Hence, statistical analysis of data enables Ho to be tested and the errors involved in estimating quantities to be determined. The former is achieved by a statistical test of the Ho, the latter by fitting an error such as a confidence interval (CI) to a sample mean, i.e., providing a range of values in which an investigator may be 95% confident that the true value lies. Where possible, investigators should formulate their study as hypotheses to be tested or effects to be estimated.

Parametric or non-parametric data

Variables can be divided into ‘parametric’ and ‘non-parametric’. A parametric variable is normally distributed whereas non-parametric variables have a distribution whose shape may be markedly different from normal. Authors should consider whether their data are parametric or non-parametric and analyse the data appropriately (Table 1). Either evidence should be provided from previous studies that a measured variable is normally distributed or a test of normality should be carried out on a sample of data. The two most common ways in which a distribution may deviate from normality are the degree of ‘skew’ and ‘kurtosis’ and statistical software will often provide a test of significance of these specific properties.

Two methods are available of analysing non-normal data. First, is to convert or ‘transform’ the original measurements so that they are expressed on a new scale that is more likely to be normally distributed than the original. Parametric statistics can then be carried out on the transformed values. Many variables are positively skewed and a logarithmic transform (Log X) would be appropriate. In addition, a square root (√X), reciprocal (1/X), or angular (arcsin) transformation may be required in specific circumstances.7 After analysis, results should be converted back to the original scale. Authors should justify any transformation of the data they undertake prior to analysis. Second, a non-parametric or ‘distribution-free’ test can be carried out. There are several non-parametric tests available corresponding to the more familiar ‘t’ test, correlation coefficient (‘r’), and simpler forms of anova and these are listed in Table 1.

‘Bootstrap’ methods

An increasingly popular method of statistical inference involves ‘bootstrap’ methods,8 and is a general method of estimating the properties of a quantity by sampling from a distribution which approximates to that of the quantity studied. The method can be used in a variety of circumstances including the construction of tests of significance, as an alternative to parametric statistics when these are in doubt, and when parameter estimation is difficult or requires complicated formulae to calculate standard errors (S.E.). It is a simple and straight-forward method to apply when estimating such quantities as percentiles of a distribution, proportions, and odds ratios. A problem with such methods of estimation, however, is that they may provide an estimate of a quantity that is too ‘liberal’ and the assumptions underlying the analysis may not be clear. As an example, the method was used to model corneal surfaces using polynomial functions.9 The specific question involved how many terms should be included in the polynomial equation to describe the curves most effectively and bootstrap methods were used to improve the accuracy of previous methods. In using such methods, authors should clearly explain the rationale of their use and consider the assumptions of the analysis.

1-tail or 2-tail tests

For many statistical procedures, it is possible to test two different Ho. First, it can be hypothesized that for example, objective measures of open- and closed-loop ocular accommodation are related to systemic cardiovascular function.10 This hypothesis does not specify whether alteration in accommodation would change cardiovascular function in a specific direction. In this case, a ‘two-tailed test’ would be appropriate, i.e., both tails of the ‘t’ distribution are used to test the Ho. Second, it could be hypothesized that a treatment could only increase a measured property and it may be inconceivable that it could decrease the measurement. If the Ho specifies whether a positive or a negative effect is necessary to refute the Ho, a ‘one-tail’ test would be appropriate. Unless it can be justified, however, two-tail probabilities should be used.

Bonferroni correction

The Bonferroni correction is a popular method used to address the problem of making multiple statistical comparisons. Hence, if ‘n’ dependent or independent Ho are tested on a set of data, then each individual Ho should be tested using a statistical significance level of 1/n times the ‘p’ value that would be used if only one Ho were tested. The Bonferroni correction has become popular in recent years but it is a ‘conservative’ procedure. Hence, important findings may be found to be not significant because of the increased risk of a type 2 error. In addition, the method is concerned with the general hypothesis that all Ho are true simultaneously which is unlikely to be of general interest.11 A further question often posed is why should the results of one specific test depend on how many other tests are carried out? Despite these reservations, reviewers often insist on the use of the Bonferroni correction and a conservative procedure may be justified by the experimental hypothesis. Hence, authors should exercise their judgment when using this correction.

Statistical tests on a single sample of measurements

Mean, Standard deviation (S.D.) and Standard error (S.E.)

If several estimates of a quantity are made, it is common practice to report their mean and standard deviation (S.D.) (a measure of the variability of the original measurements) or S.E. (a measure of the error associated with estimating the sample mean). This information should be presented as: mean (S.D. or S.E. = …).

Coefficient of variation

Variability of a sample of measurements can also be expressed as the ‘coefficient of variation’ (CV), i.e., the S.D. expressed as a percentage of the mean. The CV provides a standardised method of expressing the variability of a measurement in an experiment. Each variable often has a characteristic CV, which is relatively constant across similar experiments, and therefore, an estimate of the variability can be made in advance by examining the results of previous experiments and may be useful information in estimating sample size.

Confidence intervals for a sample mean

The S.E. of the mean can be used to calculate a ‘confidence interval’ (CI) or error bar, which is plotted on a line graph, and which indicates the degree of confidence in the sample mean as an estimate of the population mean. Investigators may plot the S.D. of a sample, the S.E. of a sample mean, or the 95% CI on a line graph or histogram and each will convey different information. We recommend that only the CI should be plotted on a line graph as S.E. are misleading in this context and are often misinterpreted.7 Hence, in a study by Armstrong12 of the densities of neuropathological changes in striate and extrastriate visual cortex in variant Creutzfeldt-Jakob disease (vCJD), confidence intervals rather than standard errors should have been fitted to the means. In addition, the error bars should not be used to make judgments as to whether there are significant differences between two or more sample mean. To test whether two means are statistically different requires another form of S.E., i.e., the ‘standard error of the difference between two means’. If three or more means are involved, then an error variance is derived from the data as a whole using anova.

Median and percentiles

If the data are not normally distributed, the S.D. is no longer an accurate descriptor of the spread of a distribution with a given mean. Authors should then quote the median and the ‘percentiles’ of the distribution, e.g., the 90% percentile of a distribution is the score such that 90% of the observations fall short of and 10% exceed the score 13.

Testing the difference between two groups

An experiment designed to compare two groups of patients or two treatments can be carried out by two different methods, viz., the ‘unpaired’ and the ‘paired’ methods. If experimental subjects were allocated at random, and without restriction, to two treatment groups then it is an ‘unpaired’ design. In this circumstance, an unpaired ‘t’ test or its non-parametric equivalent the Mann–Whitney test,14 would be appropriate (Table 1). By contrast, if experimental subjects are first, divided into pairs and second, the experimental treatments are then allocated independently and at random to the members of each pair, the experiment is in a paired design. In this circumstance, a paired ‘t’ test or its non-parametric equivalent the Wilcoxon signed-rank test15 should be used (Table 1). The ‘paired’ method, for example, would be appropriate for comparing intraocular pressure (IOP) measured pre- and post-pupil dilation while the ‘unpaired’ method would be more appropriate for a comparison between IOP measured in two separate populations.

Testing the difference between two or more variances

It may be necessary to test whether the variability of two or more sets of data differ. For example, an investigator may wish to determine whether a new drug reduces the variability in the response of a patient sample compared with an older more conventional treatment. In addition, an important assumption for the use of many parametric statistical tests such as the ‘t’ test or anova is that the variability between replicates is similar in each group, i.e., that they exhibit ‘homogeneity of variance’.

Many of the recommended tests for comparing variances (Table 1) have limitations. Hence, ‘Bartlett’s test of homogeneity of variance’13,16 is regarded as ‘sensitive’ resulting in too many significant results especially with data originating from highly skewed distributions.13 Hence, the test may raise unjustified concerns about whether the data conform to the assumption of homogeneity of variance. Nevertheless, alternative tests, such as Levene’s test17 and the Brown–Forsythe test,18 also have problems. Levene’s test makes use of the absolute deviation of the individual measurements from their group means rather than the variance to measure the variability within a group. Avoiding the squaring of deviations, as in the calculation of variance, results in a measure of variability that is less sensitive to the presence of a long-tailed distribution. More recently, Levene’s test has also been called into question since the absolute deviations from the group means are likely to be highly skewed and therefore, violate another assumption required for an anova, that of normality. This problem becomes particularly acute if there are unequal numbers of observations in the various groups being compared. The Brown–Forsythe test differs from Levene’s test in that an anova is performed not on the absolute deviations from the group means but on deviations from the group medians. This test may be more accurate than Levene’s test even when the data deviate from a normal distribution. Nevertheless, both Levene’s and the Brown–Forsythe tests suffer from the same defect in that to assess differences in variance requires an anova, and an anova requires the assumption of ‘homogeneity of variance,’ which some authors consider to be a ‘fatal flaw’ of these analyses. Hence, we recommend the use of the variance-ratio (F) test to compare two variances and the careful application of Bartlett’s test if there are more than two groups.

Frequency data classified into categories

There are two common situations in which the analysis of frequency data is required. First, a single variable may be classified into two or more categories and the objective may be to test whether an observed series of counts or frequencies is in accord with a series of expected frequencies (‘goodness-of-fit’ test). This is the procedure used when making a test of normality. Most statistical software will offer two methods of judging whether there are significant deviations of observed from expected frequencies, viz., the chi-square (χ2) and the Kolmogorov-Smirnov (KS) test. These tests have different sensitivities and limitations and may give conflicting results. Second, the frequencies may be expressed as a contingency table. When applied to a 2 × 2 table, however, the test is approximate and care needs to be taken in analysing data when the expected frequencies are small, either by applying ‘Yate’s correction or by using Fisher’s 2 × 2 exact test.19 Larger contingency tables (rows × columns) can also be analysed using this method. If the data in a 2 × 2 table are paired, then McNemar’s test should be used.

Analysis of three or more groups (anova)

If data are classified into three or more groups, the appropriate analysis is anova. anova is a method comprising many different variations, each of which applies in a particular experimental context.1,20 Unless experienced with the method, investigators should either consult a statistician prior to carrying out a more complex study involving anova or an authoritative textbook.13,21 Statisticians may disagree in their advice of how to analyse data from a complex experiment and it is not possible for us to be definitive in the advice offered and neither do we wish to be too prescriptive. Nevertheless, some common experimental designs together with their recommended anova are described with reference to a fictitious experiment involving a trial of a new glaucoma treatment.

One-way anova (fixed effects model)

If the data comprise randomly obtained measurements classified into three or more groups, they comprise a ‘one-way classification in a randomised design’. This design would result in our clinical trial, for example, if four concentrations of a new drug to treat glaucoma were tested and each given to a random sample of subjects. The anova appropriate to this design is an example of a ‘fixed effects model’ since the objective is to estimate the possible differences between the treatment groups, which are regarded as ‘fixed’ or ‘discrete’ effects. Another common use of the one-way anova in the clinical literature is in the analysis of age differences in which a random sample of subjects is first divided into groups according to their age.22,23 This design should not be confused with a ‘single-factor, repeated measures’ design24,25 which are examples of two-way anova since either treatments are given in a random order to a number of subjects individually or a specific treatment is measured on the same individuals at various times.

The first stage of the analysis is a variance ratio test (‘F’ test) to determine whether all the treatment means come from the same population. If treatment groups are few, say three or four, a non-significant ‘F’ suggests that it is unlikely that there are meaningful differences among the means and no further analysis would be required. A significant value of ‘F’, however, suggests real differences among the treatment means and the next stage of the analysis would involve a more detailed examination of these differences.

There are various options available for the subsequent analysis of the data depending on the objectives of the experiment. Specific comparisons may have been planned before the experiment was carried out, decided during the analysis stage, or comparisons between all possible combinations of the group means may have been envisaged using ‘post-hoc’ tests. Authors should justify the use of a specific ‘post-hoc’ test. The individual tests vary in how effectively they address a particular statistical problem and their sensitivity to violations of the assumptions of anova and these have been discussed previously.20 The most critical problem is the possibility of making a Type 1 error, i.e., rejecting the Ho when it is true and post-hoc tests included in statistical software give varying degrees of protection against making a Type 1 error.

There are a limited number of non-parametric tests available for comparing three or more different groups (Table 1). A useful non-parametric anova is the Kruskal–Wallis test26 and is the non-parametric equivalent of the one-way anova and essentially tests whether the medians of three or more independent groups are significantly different. The degree of discomfort experienced by patients when using Goldman tonometry without anaesthetic was evaluated, the Kruskal–Wallis test being used to compare different conditions.27

One-way anova (random effects model)

An alternative one-way model is the ‘random effects model’ in which the objective is not to measure a ‘fixed’ effect but to estimate the degree of variation of a particular measurement and to compare different sources of variation in space and/or time. These designs are often called ‘nested’ or ‘hierarchical’ designs.13 Hence, in our glaucoma trial, we may wish to determine the degree of variability in the response to a treatment within a single patient, between different patients, or with time. The most important statistics from a random effects model are the ‘components of variance’ which estimate the variance associated with each of these sources of variation influencing a measurement.1 Hence, the nested design is particularly useful in preliminary experiments designed to estimate different sources of variation and hence, in the design of experiments.

Two-way anova

In a two-way design, each treatment is allocated by randomization to one experimental unit within each group. The name given to each group varies with the type of experiment. Originally the terminology ‘randomised blocks’ was applied to this type of design because it was first used in agricultural experiments in which experimental treatments were given to units within ‘blocks’ of land, plots within the same block tending to respond more similarly compared with plots in different blocks.13 In many applications, the block is a single trial or ‘replication’ of the comparison between the treatments of an experiment, the trial being carried out on a number of separate occasions. In clinical studies of human vision, there is often considerable variation from one subject or patient to another and hence treatments are often given successively to each ‘subject’ in a random order; the subject therefore comprising the ‘block’ or ‘replication’. Hence, in our glaucoma trial, four different concentrations of the new treatment could be given, in a random order and with a suitable interval between treatments, to each subject. Another example of the use of this design is an experiment in which a treatment is given to one eye selected at random in a sample of subjects while the adjacent eye is used as a control.

The two-way design has been variously described as a matched-sample F-test, a simple within-subjects anova, a one-way within-groups anova, a simple correlated-groups anova, and a one-factor repeated measures design. This confusion of terminology is likely to lead to problems in correctly identifying this analysis within commercially available software packages. The essential feature of the design, however, is that each treatment is allocated by randomisation to one experimental unit, usually the patient, within each group or block. Hence, Wee et al.25 compared and contrasted standard and alternative versions of refractor head (phoropter)-based charts used to determine reading ability, different treatments being applied in a random order. This is an example of the two-way design but the anova used was described as a ‘single factor, repeated measures’.

Friedmann’s test,28 which compares the medians of three or more dependent groups, is the non-parametric equivalent of the two-way anova and was also applied in the study of Baptista et al.27 described earlier (Table 1).

A factorial anova, fully randomised

A factorial experiment differs from a single factor experiment in that the effects of two or more factors or variables can be studied at the same time. Combining factors in a single experiment has several advantages. First, a factorial experiment usually requires fewer replications than an experiment that studies each factor individually in a separate experiment. Second, variation between treatment combinations can be broken down into components representing specific comparisons or ‘contrasts’29,30 and which reveal the synergistic or interactive effects between the factors. The interactions between factors often provide the most interesting information from a factorial experiment and cannot be obtained from a single factor design. Third, an investigator can often add variables considered to have an uncertain or peripheral importance to the design with little extra effort. In a fully randomised design, replicates are allocated at random to all possible treatment combinations and there will be a single error term to test all main effects and interactions. Hence, in our glaucoma trial, we may wish to study the interaction of a new glaucoma treatment with patient age, a random sample of subjects being divided into two age-groups and then at random within each group, one half of the subjects being given the new treatment, the other acting as a control.

A factorial anova, split-plot design

In some experimental designs, the factors may not be equivalent to each other. A common case, called a ‘split-plot’ design, arises when one factor can be considered to be a ‘major’ factor and the other a ‘minor’ factor. This situation could arise if a different treatment was given, at random, to the right and left eyes of human subjects employing two different subject groups.1 The problem is the dependence or correlation between the measurements on eyes made on the same subject.31 In such as experiment, the subject group would be the major factor while right/left eye would be regarded as the minor factor. Hence, in our glaucoma trial, the major factor could be two groups of patients (high and low myopes) and the minor factor (control and treatment) given at random to the right and left eye. The difference between this and an ordinary factorial design is that previously, all replicates could be allocated at random to all treatment combinations whereas in a split-plot design, replicates can only be allocated at random to the main-plots; the sub-plot treatments being ‘randomised’ only within each main-plot. In some applications, experimenters may subdivide the sub-plots further to give a split-split-plot design.13 Hence, in a two-factor, split-plot anova, there are two error terms, the ‘main-plot error’ is used to test the main effect of subject group while the ‘sub-plot error’ is used to test the main effect of eyes and the possible interaction between the factors.

A factorial anova, repeated measures design

The ‘repeated measures’ factorial design is a special case of the split-plot design in which a measurement is made sequentially on experimental subjects over several intervals of time. With two groups of subjects, the anova is identical to the preceding example but with time constituting the sub-plot factor. Repeated measurements made on a single individual are likely to be highly correlated and therefore the use of the usual ‘post-hoc’ tests is questionable. Nevertheless, it is possible to partition the variance attributable to main effects and interaction into ‘contrasts’. In a repeated measures design, the shape of the ‘response curve’, i.e., the regression of the measured variable on time, may be of particular interest. A significant interaction between the main-plot factor and the repeated measure would indicate that the response curve varied at different levels of the main-plot factor. In a longitudinal study of vergence in incipient presbyopia, repeated measures anova was used to show that with decline in amplitude of accommodation, there was a statistically significant reduction in magnitude of vergence adaptation to both base-out and base-in prism.32


Testing the degree of ‘correlation’ between two variables (Y,X or X1,X2) is one of the most commonly used of all statistical methods.7,33 A test of correlation establishes whether there is a ‘linear’ relationship between two different variables and the most widely used statistic is Pearson’s correlation coefficient ‘r’.34 Nevertheless, Pearson’s ‘r’ is often misinterpreted in studies. First, the square of the correlation coefficient ‘r2’ also known as the ‘coefficient of determination’, measures the proportion of the variance associated with one of the variables that can be accounted for or ‘explained’ by the other variable. When large numbers of pairs of observations are present, e.g., N > 50 pairs, the value of ‘r’ although significant may be so low that one variable may account for a very small proportion of the variance in the other. For example, in the study of Nomura et al.,35 IOP in the Japanese population was correlated inversely with age in men (r = −0.14, p < 0.001). Although the value of ‘r’ was highly significant, only approximately 2% of the variance in IOP could be attributable to age, i.e., 98% of the variance in IOP was due to factors other than age. This is commonly observed in observational studies of large numbers of variables in which the objective may be to ‘explain’ the source of variation in one of the variables. A number of X variables may be correlated with a Y variable but each may account for only a small proportion of the total variance. Despite their statistical significance, correlations of small magnitude are not of practical value because they account for little of the total variability. Second, care is needed to ensure that only homogeneous groups are included in the correlation. Fourth, in many correlation studies a significant value of ‘r’ does not imply that there is a ‘causal’ relationship between the two variables.7

If the data are not parametric, then a non-parametric correlation coefficient may be more appropriate. The most widely used of the non-parametric correlation coefficients are Spearman’s rank correlation (rho or rs36), Kendall’s tau (τ37), and gamma13 (Table 1).


It may be necessary to compare two different methods of estimating the same quantity and therefore, to test to what extent do the methods agree or disagree with each other. For example, in clinical studies of human patients, a certain measurement may be very difficult to make on a patient without adverse effects so that its true value is unknown. Instead a new ‘indirect’ method may be used to estimate the measurement and it is often necessary to evaluate the new method against the old. In addition, different methods might be used to estimate a quantity and their level of agreement would be important in deciding whether the two methods could be used interchangeably.

As shown by Bland and Altman,38 Pearson’s ‘r’ is a measure of the ‘strength’ of the relationship between two variables and not the degree of ‘agreement’ between them. A perfect correlation would be present if the points lay along any straight line but only if the points lay along the line of equality (the 45° line) would they indicate complete agreement. Moreover, a highly significant correlation between two variables can hide a considerable lack of agreement. A better measure of agreement is to consider by how much does one method differ from the other and how far apart can the measurements be before causing problems. The essential feature of a Bland and Altman analysis is that for each pair of values the difference between them is plotted against the mean of the two values. The mean of all pairs of differences is known as the degree of ‘bias’ and is the central line plotted on a Bland and Altman graph. Either side of the bias line are the 95% confidence intervals in which it would be expected that 95% of the differences between the two methods would fall. Authors should carry out a Bland and Altman analysis of their data to test reliability and repeatability.38 Hence, Cheng et al.39 investigated retinal thickness profiles in myopic and non-myopic eyes at different locations on the retina, a Bland and Altman analysis being used to test the repeatability of peripheral retinal thickness measurements.


Whereas in correlation studies, there may not be an obvious dependent and independent variable, in regression studies the two variables are usually designated as Y the ‘dependent’, ‘outcome’, or ‘response’ variable and X the ‘independent’, ‘predictor’, or ‘explanatory’ variable. The objective of a regression analysis is to study the shape of the relationship (whether linear or curved), to establish a mathematical equation linking Y and X, or to predict the value of Y for a new value of X.33

Testing goodness of fit of the line

A test of the goodness of fit of the data points to the regression line is essential. There are three common methods depending on the objectives of the analysis. First, ‘r2’ estimates the ‘strength’ of the relationship between Y and X. Second, anova tests, using an ‘F’ test, whether a statistically significant line is present. Third, a ‘t’ test of whether the slope of the line is significantly different from zero can be carried out. In addition, it is important to check whether the data fit the assumptions for regression analysis and, if not, whether a transformation of the Y and/or X variables is necessary.

Using a regression line for prediction

A regression analysis can also be used to predict a value of Y from a new reading of X, e.g., to determine the predicted intraocular pressure (IOP) for an individual of a given age. There are two major types of prediction problem. First, there is the prediction of the ‘population’ regression line ‘μ’ for a new value of ‘x’. Hence, an investigator may wish to make inferences about the ‘height’ of the ‘population regression line’ at the point ‘x’, i.e., the average value of Y associated with a value of ‘x’. Second, there is prediction of an ‘individual’ new member of the population y1 for which x1 has been measured. In both cases, however, the predicted value of Y is actually the same but the S.E. associated with the predictions will be different. Significantly greater errors will result when estimating an individual rather than a population value.

Comparison of regression lines

If the relationship between two variables has been studied at various times or in different laboratories, it gives rise to two or more estimates of the relationship between Y and X. In these circumstances, it may be important to test whether the various regression lines are the same. For example, the relationship between IOP and age may have been measured in two different populations A and B. Regression lines may differ in three properties. First, they may differ in residual variance, i.e., in the degree of scatter about the lines and therefore, one line may fit the data better than the other. Second, they may differ in slope ‘b’, i.e., one line may exhibit a greater change in Y per unit of X than the other. Third, the lines may differ in elevation ‘a’, i.e., if the two lines have the same slopes they will intersect the Y axis at different points. Loss in contrast sensitivity (CS) to low spatial frequencies under scotopic conditions in older adults was studied by Clark et al.40 The investigators were particularly interested in whether temporal frequency of a stimulus altered the relationship between age and spatial contrast sensitivity function. They found that in a group of young and old observers, there was no difference in slope between age groups across the temporal frequency range.

Curvilinear regression

Linear regression may be adequate for many purposes but some variables may not be connected by such a simple relationship. The discovery of the precise relation between two or more variables is a problem of curve fitting known as ‘non-linear’ or ‘curvilinear regression’ and the fitting of a straight line to data is the simplest case of this general principle. There are three types of curve fitting procedure commonly used, viz., curves that can be transformed to straight lines such as the exponential curve, the general polynomial curve, and curves that can only be fitted by more complex methods such as non-linear estimation.13 As an example, the relationship between macular pigments and temporal vision was studied by Renzi and Hammond41 and a third-order (cubic) polynomial was fitted to the temporal contrast sensitivity function.

Multiple linear regression

Multiple linear regression (see references 13 and 21 for authoritative treatments of this topic) determines the linear relationship between a dependent variable (Y) and multiple independent variables (X1, X2, X3… Xn) and has many uses. First, it enables a linear equation involving the X variables to be constructed that predicts Y, e.g., an investigator may wish to predict the degree of lens accommodation in a patient under a set of conditions specified by a series of X variables including demographic data such as age and morphological features of the lens and eye. Second, given several possible X variables that could potentially be related to Y, an investigator may wish to select a subset of the X variables that gives the best linear prediction equation. Third, it may be important to determine which of a group of X variables are actually related to Y and to rank them in order of importance. For example, Hazel et al.42 determined which objective measures of visual function (X variables) are the most closely related to the perceived subjective quality of life in patients with acquired macular disease (Y). In addition, Leat and Woodhouse43 in a study of reading performance with low vision aids, used multiple regression to show that the best predictor of reading rate was contrast sensitivity at 0.5 c deg−1, and no other component of the contrast sensitivity function helped to explain more of the variance. Multiple regression, however, should be thought of as an exploratory method, the results of which should be tested on a new set of data and preferably, by a more rigorous experimental approach.21

Goodness of fit of points to regression plane

A goodness of fit test of the multiple regression to the data points should be carried out using anova. The ‘F’ test determines whether any of the X variables included in the regression are related to Y. Alternatively, a ‘t’ test of the significance of each of the regression coefficients (b1, b2) can be made. It should be noted that even if the regression coefficients are statistically significant, it is not uncommon that the fraction of the variation in the Y values attributable to or ‘explained’ by the regression may be considerably <50%. A multiple regression analysis in which <50% of the overall variance in Y is explained by the X variables will probably have limited value.21

Multiple correlation coefficient ‘R

The multiple correlation coefficient ‘R’ is the simple correlation between Y and its linear regression on all of the X variables included in the study. Hence, ‘R2’ is the fraction of the variation of the Y values attributable to the regression as a whole while ‘1−R2’ is the proportion of the variation not associated with the regression. Hence, ‘R’ should be at least 0.7 (R2 = 49%).

Interpretation of regression coefficients

In any study, there will be X variables related to Y which have not been included. These may be variables thought to be unimportant, too difficult to measure, or are unknown to the investigator. It is therefore advisable, at least initially, to include all X variables in the study that are likely to affect Y or to study a population in which variables not of direct interest can be controlled. Introducing more variables into an analysis, however, adds to the data collection effort, may contribute only noise to the prediction, and may reduce the sensitivity (‘power’) of the analysis.21 Deciding which variables to include in a study is usually a compromise between trying to achieve good predictive power while excluding irrelevant variables.

Stepwise multiple regression

Authors may wish to select a small subset of the X variables which give the best prediction of the Y variable. In this case, the question is how many variables should the regression equation include? The recommended method is a ‘stepwise multiple regression’ analysis. There are two forms of this analysis called the ‘step-up’ (or forward) method and the ‘step-down’ (or backward) method.13

In the step-up method, variables are entered into the equation one at a time. At each stage, introduction of a new variable is tested to determine whether its effect is statistically significant, using an ‘F’ test, or a change in ‘R2. Two criteria are often used. First, ‘F to enter’ sets an ‘F’ value which has to be exceeded before a variable will be added into the equation. Second, ‘F to remove’ sets a value that the computer uses to decide whether, after adding a new variable, a variable previously entered should be removed. If change in ‘R2’ is used as a test criterion, there should be a change of at least a few percentage points before a new variable is included in the equation. The analysis continues until the next variable has an ‘F to enter’ which fails to achieve significance. In the step-down method, the multiple regression of Y on all the X variables is first calculated. The contribution of each X to a reduction in variation of the Y values is then computed and the variable giving the smallest reduction eliminated by a rule. This variable is excluded and the process repeated until no variable qualifies for omission according to the rule employed.

The step-up or step-down methods do not necessarily select the same variables for inclusion in the regression and these differences are magnified when the X variables are themselves correlated. The step-up method is sometimes used to define the variables that influence Y more rigorously and exclude variables that make relatively small contributions to the regression. The step-down method may retain more variables, some of which may make small contributions to the regression, but by retaining them, a better prediction may result. There is a high probability of making a Type 1 error when carrying out stepwise multiple regression. Hence, with 20 non-significant X variables in a study, the probability of achieving at least one significant ‘F to enter’ is 1−(1–0.05)20, i.e., 0.642 (64.2%), >50% chance!

Principal components analysis (PCA) and factor analysis (FA)

Principal components analysis (PCA) and factor analysis (FA)21 are also methods of examining the relationships between variables but no distinction is made between the dependent and independent variables; all variables are essentially treated the same. Originally, PCA and FA were regarded as distinct methods but are now combined into a single analysis; PCA often being the first stage of a FA.21 The basic objective of a PCA/FA is to examine the relationships between the variables or the ‘structure’ of the variables and to simplify the data by determining whether these relationships can be explained by a smaller number of ‘factors’.

PCA/FA has many potential applications. First, the variables under study could be a sample of patients with a specific disease such as age-related macular degeneration (AMD), the data for each patient comprising quantitative measures of relevant clinical and pathological features. Individual patients with AMD often show considerable heterogeneity, i.e., significant variation in clinical symptoms and in pathology. The objective might be to describe and summarise these differences, e.g., is there a gradation of features between patients or are patients grouped together in two or more clusters that might represent distinct subtypes of the disease? This procedure was used by Armstrong et al.44 to investigate the degree of neuropathological heterogeneity within a group of 80 cases of Alzheimer’s disease. In addition, it may desirable to determine which of the measured clinical or pathological features of the disease best ‘explain’ this pattern of variability. Second, the objective may be to determine how many individual dimensions may explain a particular phenomenon. Hence, Mckee et al.45 measured optotype acuity, Vernier acuity, grating acuity, contrast sensitivity, and binocular vision in 427 adults with amblyopia or possessing the risk factors for amblyopia. Two major dimensions in variation in visual performance were identified, one related to measures of visual acuity and the other to measurements of contrast sensitivity. Third, an investigator may have an actual hypothesis as to the number of important underlying ‘factors’ that could ‘explain’ a complex phenomenon and the analysis could be used to test this hypothesis. For example, it may be hypothesised that most of the variation in clinical features among AMD patients might reflect variations in two environmental factors, viz., the lifetime exposure of the patient to sunlight and the number of cigarette packs smoked per week.46 Fourth, the variables could be questions that form part of a questionnaire.42 The purpose of the analysis might be to ‘verify’ the questionnaire, i.e., to determine whether the questions included in the survey were actually testing the aspect of vision under investigation.

Concluding remarks

  • 1The skills necessary to design experiments, analyse them appropriately, and discuss the results in context are not easy to acquire. Data analysis can sometimes be seen almost as more of an ‘art’ than a ‘science’ and indeed, different statisticians will frequently disagree as to the correct approach leading to conflict between authors, editors, and reviewers!
  • 2One of the most common problems encountered by statistical reviewers is that authors do not describe their statistical methods clearly enough. This makes it difficult to assess the validity of the analysis and the interpretation of the results. Hence, as a first principle, authors should describe clearly their experimental design, the data collection methods, and how the statistical analysis relates specifically to the design of the investigation.
  • 3Statistical analysis should therefore always be considered at the design stage of an experiment. Appropriate statistical advice should be sought at this stage especially if a more complex analysis is envisaged such as a complex factorial anova, multiple regression or PCA/FA.
  • 4The problems of sample size and power should be addressed. A sample size calculation using GPower is a useful guide to sample sizes but should be viewed with caution as calculations are often made on unreliable assumptions of subject variability and may suggest an unrealistic number of subjects or replicates. A power calculation is useful if no significant differences were detected in an experiment that was expected to reveal a ‘true’ difference.
  • 5The statistical methods should be identified in the order in which they were used. Standard techniques, e.g., mean, S.D., S.E., ‘t’ tests, ‘r’, do not need to be described but where more complex methods have been applied or where non-standard methods are used, these should be clearly explained and a justification given for their use.
  • 6The assumptions on which an analysis is based should be discussed; e.g., whether the data are normally distributed and whether different groups exhibit homogeneity of variance. A test of normality is useful if the variable measured is unusual or if there is little previous published information of its likely distribution. Highly skewed data or data with heterogeneous variances may require transformation, the use of a non-parametric test, or bootstrap methods.
  • 7Investigators should be especially alert to the different varieties of anova. There are many different forms of this analysis each of which is appropriate in the analysis of a specific experimental design. The terminology applied to these designs may also differ in available commercial software. Common errors include analysis of an experiment as a fully randomised design when randomised blocks are present, or a split-plot design as a fully randomised factorial design. It is important to check for model assumptions and that the degrees of freedom and error terms are correct. An erroneous design will not test a Ho accurately.
  • 8There should be a clear hypothesis in mind before carrying out more complex ‘multivariate’ procedures such as multiple regression or PCA/FA. Multiple regression is probably best used in an exploratory context, i.e., identifying variables that might profitably be examined in a more detailed study. Where there are many variables potentially influencing Y, they are likely to be mutually correlated and to account for relatively small amounts of the variance. Any analysis in which R2 is <50% should be regarded as probably not indicative of the presence of significant variables.


For details of the authors of this review please see the next page.

Richard A. Armstrong

inline image

Dr Armstrong was educated first at King’s College London and subsequently at St. Catherine’s College, Oxford. His early research involved the application of statistical methods to problems in ecology and botany. He taught ecology for many years at the University of Aston before retraining in Neurosciences at the Institute of Psychiatry, London and at the University of Washington, Seattle. Subsequently he has taught biomedical subjects to students of optometry at the Vision Sciences department of Aston University. His current major research interest is in the application of quantitative methods in the study of the pathology of neurodegenerative disease.

Mark Dunne

inline image

Dr Dunne has been a lecturer in the Optometry Department at Aston University since 1990. He teaches anatomy, physiology, statistics, epidemiology and clinical practice development. He also teaches research methods on the Ophthalmic Doctorate.

Leon N. Davies

inline image

Dr Davies is currently a senior lecturer in optometry and convener of the Ophthalmic Research Group at Aston University. Leon’s clinical teaching at Aston is centred around investigative techniques. He served a term as the clinical editor of the UK’s professional journal Optometry Today, and recently received the inaugural College of Optometrists Research Fellowship Award for his work on presbyopia.

Bernard Gilmartin

inline image

Following undergraduate and postgraduate study in optometry at City University, Professor Gilmartin joined what was then the Department of Ophthalmic Optics at Aston University in 1974 and spent the rest of his formal career there, concentrating largely on teaching clinical optometry and ophthalmic drugs. He was editor of Ophthalmic and Physiological Optics during the period 1987–2000 and is a Life Fellow of the College of Optometrists.