**Citation information**: McAlinden C, Khadka J & Pesudovs K. Statistical methods for conducting agreement (comparison of clinical tests) and precision (repeatability or reproducibility) studies in optometry and ophthalmology. *Ophthalmic Physiol Opt* 2011, **31**, 330–338. doi: 10.1111/j.1475-1313.2011.00851.x

**Ophthalmic and Physiological Optics**

# Statistical methods for conducting agreement (comparison of clinical tests) and precision (repeatability or reproducibility) studies in optometry and ophthalmology

Colm McAlinden

E-mail address: colm.mcalinden@gmail.com

## Abstract

### Abstract

The ever-expanding choice of ocular metrology and imaging equipment has driven research into the validity of their measurements. Consequently, studies of the agreement between two instruments or clinical tests have proliferated in the ophthalmic literature. It is important that researchers apply the appropriate statistical tests in agreement studies. Correlation coefficients are hazardous and should be avoided. The ‘limits of agreement’ method originally proposed by Altman and Bland in 1983 is the statistical procedure of choice. Its step-by-step use and practical considerations in relation to optometry and ophthalmology are detailed in addition to sample size considerations and statistical approaches to precision (repeatability or reproducibility) estimates.

## Introduction

Technology is developing at an explosive rate as can be seen with the constant production of new instrumentation in ophthalmic practice. One common issue with new instrumentation or clinical tests is the agreement with other existing instruments or test. In relation to the eye, instruments provide an indirect estimate rather than a direct measure. Therefore studies generally aim to determine the agreement between two indirect estimates, often where one is considered to be the ‘gold standard’, but without a measure of the true value which remains unknown. This influx of new instrumentation has resulted in a multitude of agreement studies in the literature. Researchers often aim to determine the agreement between two clinical tests in order to determine if they may be used interchangeably. It is vital that such studies are conducted utilising the appropriate statistical methods.

The aim of this review is to guide researchers and authors as to the appropriate statistical methods to apply in agreement studies between two sets of quantitative data.

## Correlation

The Pearson product moment correlation coefficient is often misapplied in agreement studies. It is calculated by dividing the sum of products by the square root of the sum of the squares of the two variables. It describes the closeness of the linear relationship between two variables. The Pearson correlation coefficient is the most commonly used correlation coefficient. Similar correlation coefficients include population and sample both calculated using means and standard deviations for a population or a sample respectively. The interpretation of the sample correlation coefficient depends on how the sample data are collected. With a simple random sample, the sample correlation coefficient is an unbiased estimate of the population correlation coefficient. However, correlation coefficients are often used to incorrectly assess agreement between two tests or methods.^{1} This test is potentially misleading as there may be a strong correlation between two variables but poor agreement. This occurs because the Pearson correlation coefficient does not depend on assessing a one to one (line of equality) relationship between two measurement methods. The nature of the relationship is not assessed beyond its linearity.

Similarly, the correlation coefficient cannot detect systematic error. For example autorefraction may have a strong correlation with subjective refraction, but this will not indicate if autorefraction systematically underestimates the spherical component of the refraction by one dioptre (D).

Hence the correlation coefficient should only be used to assess if a linear relationship exists between two variables. However, even under these circumstances it has serious limitations. If it is used to assess linearity between two measurement methods, it will be very dependent on the variability between subjects. Assume the following scenario where 20 patients had one measure of axial length by two different methods of non-contact biometry as shown in *Table 1*. The correlation coefficient for the first 10 subjects is 0.47 whereas for the second 10 subjects it is 0.93. When both groups are combined, the coefficient improves to 0.98. Hence the correlation coefficient will vary considerably depending on the variability between subjects whereby a larger coefficient will be found in a sample with greater differences between subjects compared to a sample with similar subjects. The correlation coefficient has also been used to inappropriately assess predictive factors between two variables, i.e. to determine if one variable causes another variable to change. Correlation only identifies the relationship between two variables, regression analysis is required to determine the value of one variable in terms of another.^{2} However, there is a close relationship between correlation and regression, whereby the correlation coefficient squared (*r*^{2}) provides a measure of the degree to which the variance in one variable can be attributed to its regression on the other.

Subject | Axial length (mm) | |
---|---|---|

Method 1 | Method 2 | |

1 | 21.12 | 21.78 |

2 | 21.45 | 21.04 |

3 | 21.87 | 21.32 |

4 | 22.05 | 22.12 |

5 | 21.98 | 21.82 |

6 | 21.82 | 21.72 |

7 | 21.52 | 21.26 |

8 | 22.00 | 22.19 |

9 | 21.98 | 21.67 |

10 | 21.76 | 21.42 |

11 | 23.45 | 23.56 |

12 | 23.78 | 23.76 |

13 | 23.69 | 23.12 |

14 | 23.87 | 23.99 |

15 | 24.52 | 24.41 |

16 | 24.32 | 24.12 |

17 | 24.81 | 24.72 |

18 | 24.36 | 24.39 |

19 | 24.42 | 24.48 |

20 | 24.93 | 24.99 |

The intraclass correlation coefficient (ICC) presents a superior option to the Pearson correlation coefficient. The ICC is centred and scaled using a pooled mean and standard deviation, whereas the Pearson correlation, each variable is centred and scaled by its own mean and standard deviation.^{3} There are a number of ICC versions with the choice of which version to use not being intuitively obvious.^{4} These versions vary in terms of the underlying sample theory (e.g. one-way or two-way, mixed or random). There are other similar approaches, such as Lin’s concordance correlation coefficient which assesses the closeness of the data about the line of best fit in the scatter plot by taking into account how far the line of best fit is to the line of equality. Perfect concordance is achieved with a value of one for Lin’s coefficient.^{5} However these still possess some problems, most notably, its dependence on the range of measurements. A key issue with correlation coefficients is that they are unrelated to the scale of the measurement (dimensionless). This is problematic, as they cannot inform the range of measurement over which instruments agree. But the advantage of dimensionless values is they are comparable across different constructs. For example, there is a higher correlation for anterior chamber depth measurement than for corneal power measurement when data from the Oculus Pentacam and the Orbscan II are compared.^{6} Of course the value of such a comparison is limited. For the specific application of measurement agreement correlation coefficients provide little information since they are invariably high (*Table 2*). How should these be interpreted? Is 0.97 different to 0.98? Perhaps these are both examples of excellent agreement, or perhaps 0.97 is unacceptable. Furthermore there is no consensus with regards to acceptable levels of correlation.^{7} Perhaps the popularity of correlation coefficients in agreement studies is more to do with a high correlation making the novice researcher feel good about their research rather than actually adding any value.

Study | Measurement compared | Correlation coefficient |
---|---|---|

Chen et al.^{38} | Central corneal thickness | 0.978 |

Hashem and Mehravaran^{6} | Anterior chamber depth | 0.992 |

Rohrer et al.^{39} | Corneal curvature | 0.929 |

Pesudovs and Weisinger^{20} | Refraction | 0.991 |

Shemesh et al.^{40} | Intraocular pressure | 0.84 |

Savini et al.^{41} | Anterior chamber depth | 0.98 |

Hoffer et al.^{42} | Axial length | 0.9995 |

Carey et al.^{43} | Intraocular lens orientation axis | 0.99 |

Babalola et al.^{34} | Intraocular pressure | 0.883 |

Shammas and Chan^{44} | Corneal astigmatism | 0.994 |

For non-parametric data, different statistics are required. A commonly used and valid statistic with categorical data is Cohen’s kappa coefficient.^{8} The kappa coefficient has the added advantage that is takes into account the agreement occurring by chance. However, under certain conditions, the kappa coefficient is equivalent to the ICC. For the reasons of lack of interpretability of high values and failure to provide information of the range over which two measures agree, the ICC is discouraged in the assessment of agreement between different measurement methods.^{9}

## Agreement

Altman and Bland recognised the limitations of using correlation coefficients in the clinical comparison of two measurement methods.^{10} Specifically, their landmark paper deals with the ‘limits of agreement’ between two methods, for example, comparing a new method to an old one. These issues are particularly important in biological systems where a direct measurement is not possible and the true value is unknown. Their approach has been used extensively in the assessment of agreement between two clinical methods for more than 25 years.^{11–17} By way of example we will compare the output of spherical aberration from two aberrometers in 10 patients as shown in *Table 3*. These data have been extracted from a recent publication.^{18}

Subject | NIDEK OPD-Scan Spherical aberration (μm) | AMO WaveScan Spherical aberration (μm) |
---|---|---|

1 | 0.406 | 0.214 |

2 | 0.491 | 0.178 |

3 | 0.077 | 0.095 |

4 | 0.176 | 0.151 |

5 | 0.529 | 0.262 |

6 | 0.139 | 0.261 |

7 | 0.036 | 0.137 |

8 | 0.003 | 0.089 |

9 | 0.098 | 0.112 |

10 | 0.122 | 0.117 |

If one plots a scatter graph of the two measurements (*Figure 1*) with a line of equality, perfect agreement would only exist if all points lie along this line of equality, but perfect correlation will occur if points lie along any straight line. It is unlikely that two methods will agree exactly but the ‘limits of agreement’ technique describes by how much the two methods differ and if this difference is small enough to avoid problems with clinical interpretation; they may be used interchangeably or a new method may replace an old method. This judgement as to the limit at which it is acceptable to use the two methods interchangeably is a clinical decision.

A plot of the difference between the methods against their mean may be more informative (*Figure 2*). This figure shows that differences of up to 0.35 μm exist between the two aberrometers. The agreement can be summarised by determining the bias, which is estimated by the mean difference *d,* and the standard deviation of the differences (*s*). If there is a consistent bias, it can be accounted for by subtracting *d* from the new method. In this example, *d* (OPD-Scan minus WaveScan) is 0.04 μm and *s* is 0.16 μm. It would be expected that most differences would lie within *d*−1.96*s* and *d* +1.96*s* considering differences follow a normal distribution. Provided the differences within *d *±* *1.96*s* are not clinically important (clinical interpretation is an essential attribute of this approach), the two measurements could be used interchangeably. In the above example the ‘limits of agreement’ would be:

This means that the WaveScan may be 0.27 μm below or 0.35 μm above the OPD-Scan for this sample of 10 patients (given an understanding of the magnitude of spherical aberration in normal eyes, this agreement range is very large, too large for these measurements to be considered interchangeable). These limits are usually inserted on what is known as the ‘Bland–Altman plot’ or a ‘difference plot’ (*Figure 2*).

As the limits of agreement are only estimates, confidence intervals should be calculated and reported. Firstly the standard error of *d *±* *1.96*s* is calculated from the formula √3*s*^{2}/*n* where *n* is the sample size. In our example the standard error of *d *±* *1.96*s* is 0.09. For the 95% confidence intervals we have 9 degrees of freedom (*n* − 1) which corresponds to 2.26 on the *t* distribution table. Therefore the 95% confidence intervals are:

Hence the 95% confidence interval for the lower limit of agreement is −0.27−(2.26 × 0.09) to −0.27 + (2.26 × 0.09), which equals −0.47 to −0.07 μm. The 95% confidence interval for the upper limit of agreement can be similarly calculated. Obviously these confidence intervals are wide due to the very small sample size.

The limits of agreement method is not appropriate for evaluating agreement with categorical data (e.g. agreement of the positive and negative decisions given by two diagnostic procedures). However, a non-parametric variant has been described,^{19} where data can be ranked with a range of data reported within centiles. If 95th percentiles are used, including presentation on a difference plot, this approach resembles the limits of agreement method. This method is commonly used with astigmatism data.^{20,21}

## Box 1: Worked example of Pearson, ICC and limits of agreement

In this example we have extracted data from a publication which compared autorefraction to subjective refraction in 190 subjects.^{20} Calculating the Pearson and ICC for the spherical equivalent refraction revealed the same coefficient of 0.991. The limits of agreement method found the mean difference (*d*) between methods to be −0.04D and the standard deviation (*s*) of the difference to be 0.36D. Hence, the limits of agreement (*d *±* *1.96*s*) were +0.75D to −0.67D. Certainly the correlation is high, but whether the agreement can be considered interchangeable is open to debate and really depends upon the clinical application. Now to illustrate what would happen if a serious error is introduced, we double the autorefraction result and keep the subjective refraction unchanged. The Pearson correlation remains unchanged at 0.991 but the ICC drops to 0.796. An ICC value of 0.796 which would still be classified as good according to the recommendations from Spitzer and Endicott (>0.75 = good).^{49} However, the limits of agreement would skyrocket; the mean difference (*d*) soaring to 1.16D and the standard deviation (*s*) to 5.21D. This corresponds to limits of agreement (*d *±* *1.96*s*) of +9.06D to −11.36D. We can clearly see that correlation coefficients (Pearson and ICC) give little indication of the effect of this error whereas the limits of agreement method reveals the full extent of the problem.

## Precision (repeatability and reproducibility)

In recent years, the limits of agreement method has also been used in the assessment of precision (repeatability and reproducibility).^{22–28} Precision of a device or clinical method is an important factor which should be considered when conducting method comparison studies. If a device has poor precision it is unlikely to have good agreement with another device. Hence comparing the precision of the two devices will provide greater insight into the source of any differences found. Repeatability and reproducibility are the two faces of precision. Repeatability refers to the variability in repeated measurements by one observer when all other factors are assumed constant. Reproducibility refers to the variability in repeated measurements when one or more factors, such as observer, instrument, calibration, environment or time is varied. The current guidelines from the British and International Standards recommend the expression of repeatability and reproducibility estimates in terms of standard deviations. For example, to assess the repeatability of two repeated measurements on a number of subjects, a one way analysis of variance (anova) should be performed to determine the within subject standard deviation (*S*_{w}). The *S*_{w} is the repeatability of the measurements. Considering the 95% confidence intervals around this value, the repeatability limit (*r*) is reported as 1.96√2 × *S*_{w} which gives the likely limits with which 95% of measurements should be within. This is a simplification of the reporting compared to limits of agreement because it is assumed that there will be no mean difference between repeat measures. Reproducibility and reproducibility limits are calculated and reported using the same method. These limits are identical to the previously described and commonly used repeatability or reproducibility coefficient.^{19} This approach is also mathematically related to the limits of agreement in that √2 × *S*_{w} is equivalent to 1.96 × *d*. The anova approach has the advantage of simplifying calculations when dealing with more than two sets of replicate data. However, the most important reason for using this approach is to standardise the presentation of results; repeatability should be reported in terms of repeatability (*S*_{w}) and repeatability limits. Likewise for reproducibility where, in addition, the factors allowed to vary must be specified. Repeatability limits need to be interpreted with caution when used with coarse measurements such as visual acuity when measured per-line^{29} and contrast sensitivity with large steps sizes.^{30,31} In such cases it may be useful to consider the 95% limits for change which would be the next largest step above the 95% interval. For example, if the repeatability limit of a contrast sensitivity chart was ±0.23 log CS, but the chart used step sizes of 0.10 log CS, the criterion for change would be ±0.30 log CS or ±3 steps.

## Relationship between the average and the difference in agreement studies

Occasionally there may be a relationship between the average and the difference such as an increase in the scatter of points with an increase in the average value. This would be evident from the Bland–Altman plot or simple relationship testing. Once detected the relationship should be removed via data transform to enable a more accurate assessment of the limits of agreement across the measurement range. Most commonly this will be a log transformation, but in cases where a log transformation does not improve the relationship between the average and the difference, the limits of agreement will tend to be too far apart rather than too close; therefore it is unlikely to lead to the acceptance of poor methods of measurement. However regression approaches have been proposed in cases where logarithmic transformations prove unsatisfactory.^{19,32,33}

## Sample size for agreement studies

In order to determine the sample size required for an agreement study one must determine how accurate one wants to estimate the limits of agreement in terms of confidence intervals. The following approximated formula may be used where *n* is the sample size:

Therefore, if 10 subjects are used, this is approximately 1.07*s* whereas if 100 subjects are used this equates to 0.34*s* with the latter estimating the limits of agreement much more accurately. For example if we consider the comparison of the two aberrometers with 10 subjects, the confidence intervals are estimated to be 0.17 whereas if 100 subjects were used, the confidence intervals are estimated to be 0.05. We saw earlier that the confidence interval for the lower limit of agreement between the two aberrometers was −0.47 to −0.07 μm with a sample size of 10. This is a large range and an increase in sample size would provide a more accurate assessment of the limits of agreement and provide tighter confidence intervals. The main problem with wide confidence intervals is the potential of finding acceptable agreement between two methods when a larger sample would prove otherwise. Large numbers also likely expands the range over which agreement is tested which increases the validity of the result. Hence, our recommendation is that sample sizes should be at least 100 subjects in agreement studies. *Table 4* contains a series of examples of appropriate sample sizes for method comparison studies in optometry and ophthalmology.

Metric | Techniques compared | Study used to calculate sample size | Desired CI of LoA | S.D. of the differences (s) | Sample size required (1.96√3s^{2}n = CI desired) |
---|---|---|---|---|---|

Central corneal thickness | Low-coherence reflectometry (Lenstar biometer) and Scheimpflug photography (Pentacam) | Huang et al. 2011.^{45} | 1.5 μm | 6.12 μm | 98 |

logMAR visual acuity | ETDRS chart and reduced logMAR E (RLME) chart | Bourne et al. 2003.^{46} | 0.005 logMAR | 0.02 logMAR | 95 |

Spherical equivalent refraction | Subjective refraction and autorefraction | Pesudovs and Weisinger 2004.^{20} | 0.08 dioptres | 0.36 dioptres | 120 |

Intraocular pressure | Goldmann tonometry and dynamic contour tonometry | Ceruti et al. 2009.^{47} | 0.50 mmHg | 2.04 mmHg | 98 |

Anterior chamber depth | Optical coherence tomography (Visante) and ultrasound biomicroscopy | Zhang et al. 2010.^{48} | 0.02 mm | 0.09 mm | 120 |

## One eye or two?

The statistical issues related to paired eye data have long been known, but remain widely abused.^{13,23,34–36} Data from paired eyes are likely to be correlated (except in asymmetric disease e.g. keratoconus) so confound correlation and regression analyses. This problem is commonly avoided through the use of data from one eye of an individual or can be controlled for in regression models or estimating equations via a number of different approaches.^{37} While this is an important confounder for calculation of an ICC, paired eye data do not confound limits of agreement calculations. However, if the data from two eyes are very similar using the second eye adds little information. Therefore in a population of normal eyes, using paired data simply doubles the number of data points yet may only provide the same information as one eye per person. Moreover, paired data may artificially reduce the confidence interval around the limits of agreement if number of eyes is used in the calculation rather than number of participants. For limits of agreement calculations, controlling for paired data is not possible, therefore the decision is whether to include one eye or two. We recommend that in studies where data from both eyes are highly correlated, especially normal eyes, only one eye per participant be used. However, in populations with asymmetric eye disease, use of data from both eyes is wholly appropriate.

## Summary

In the comparison of two instruments or clinical tests, the limits of agreement approach should be applied and correlation coefficients including ICC should be avoided. This review provides a step-by-step guide for researchers and authors conducting agreement studies in optometry and ophthalmology.

## Acknowledgements

No author has a commercial or financial interest in any product mentioned.

## Bibliography

Please see the next page for the authors bibliography.

## Appendices

### Colm McAlinden

Colm McAlinden undertook his undergraduate degree at Cardiff University followed by training in Moorfields Eye Hospital London. He subsequently completed a Masters and a PhD at the University of Ulster, followed by a post-doctoral fellowship at Flinders University (South Australia). His research interests are primarily in the field of refractive surgery, ophthalmology outcomes research, and statistics.

### Jyoti Khadka

Jyoti Khadka did his undergraduate optometry degree at Tribhuvan University (Nepal). He was awarded with a PhD at Cardiff University (UK) in 2010. Currently, he is working as a full-time research associate at Flinders University (South Australia). His research interests are ophthalmology outcome studies, item banking, computer adaptive testing and low vision rehabilitation outcome studies. He is currently involved in developing a vision-specific item bank in ophthalmology with Prof. Konrad Pesudovs.

### Konrad Pesudovs

Konrad Pesudovs completed a PhD at Flinders University (South Australia) Department of Ophthalmology in 2000. He undertook post-doctoral training in the UK and the USA before returning to Flinders University in 2004, where, in 2009 he was appointed Foundation Chair of Optometry and Vision Science. His research interest is ophthalmology outcomes research; incorporating optical, visual and patient-reported measurement into holistic outcome assessment. A key element being the development of patient-reported measures using Rasch analysis. Konrad has completed numerous agreement and precision (repeatability and reproducibility) studies into ophthalmic outcome measures.