## Introduction

Many research papers in imaging concern measurement. This is a topic that in the past has been much neglected in the medical research methods literature. In this paper we discuss the estimation of the agreement between two methods of measurement, and the estimation of the agreement between two measurements made by the same method, also called repeatability. In both cases we are concerned with the question of interpreting the individual clinical measurement. For agreement between two different methods of measurement, we ask whether we can use measurements by these two methods interchangeably, i.e. can the method by which the measurement was made be ignored. For two measurements made by the same method, we ask how variable measurements from a patient can be if the true value of the quantity does not change and what this measurement tells us about the patient's true or average value. In some studies repeated observations are made by the same observer or many different observers and are treated as repeated observations of the same thing. In others a small number of observers, often two, are used and systematic differences between them are explored, in which case the analysis is like that for comparing two different methods of measurement.

We avoid all mathematics, except for one formula near the end. Instead we show what happens when some simple statistical methods are applied to a set of randomly generated data, and then show how this helps the interpretation of these methods when they are used to tackle measurement problems. We illustrate these methods by examples drawn from the imaging literature. For some of these examples, rather than bother the original authors for their data, we have digitized them approximately from the published graphs, and our figures differ slightly but not in any important way from those originally published.

We shall start with a typical example of a measurement study. Borg *et al.*1 compared single X-ray absorptiometry (SXA) with single photon absorptiometry (SPA). They produced a scatter plot for arm bone mineral density similar to that in Figure 1. This looks like good agreement, with a tight cloud of points and a high value for the correlation coefficient, *r* = 0.98. The points cluster quite closely around the line drawn through them, the regression line. But should this make us think we could use bone mineral densities measured by SXA and SPA interchangeably? In Figure 2 we have added the line of equality, the line on which points would lie if the two measurements were the same. Nearly all the points lie to the left of the line of equality. There is a clear bias: the SXA measurements tend to exceed the SPA measurements by 0.02 g/cm^{2}. We shall now explain why the correlation coefficient does not reflect this bias and go on to explore the interpretation of the regression line. To do this we show what happens when these methods are applied to artificially generated data, i.e. when we know what the interpretation should be. We then describe a simple alternative approach, limits of agreement, which avoids such problems.