### Abstract

- Top of page
- Abstract
- Introduction
- Examples
- Definition of intraclass correlation coefficient and total deviation index
- Results
- Discussion
- Acknowledgement
- References

The analysis of concordance among repeated measures has received a huge amount of attention in the statistical literature leading to a range of different approaches. However, because all the approaches are able to assess the closeness among the readings taken on the same subject, the conclusions about the degree of concordance should be similar regardless the approach applied. Here, two indices to assess the concordance among continuous repeated measures, the intraclass correlation coefficient and the total deviation index, are applied and compared in two case examples. The first example concerns the repeatability of individual nutrient allocation strategy assessed by stable isotope analysis. The second example dealt with the assessment of the concordance of functional magnetic resonance imaging data that shows spatial correlation. The results differ depending upon the approach applied leading to contradictory conclusions about the degree of concordance. The reason behind these results is discussed reaching the conclusion that the total deviation index is just assessing agreement among repeated measurements, whereas the intraclass correlation coefficient assesses the concept of distinguishability among subjects that involves agreement among repeated measurements and spread of subjects at once. Therefore, the best way to select the right approach is to understand the right question behind the research hypothesis. Copyright © 2013 John Wiley & Sons, Ltd.

### Introduction

- Top of page
- Abstract
- Introduction
- Examples
- Definition of intraclass correlation coefficient and total deviation index
- Results
- Discussion
- Acknowledgement
- References

The aim of an agreement assay is to assess the degree of concordance among some repeated readings taken on a sample of subjects. Each reading can be measured under the same conditions or alternatively such conditions can change across readings. For example, the readings may be taken either by the same observer (same condition) or a group of different observers (different conditions). As a consequence it is possible to find the agreement assay under different names depending on such conditions that change across readings. Hence repeatability or reliability are concepts related to the agreement among repeated measures under the same measuring conditions. On the other hand, concordance usually refers to the agreement of readings taken by different observers.

However, the methodology to assess the agreement usually depends on the type of data: qualitative or quantitative. Hence, Kappa indexes [1, 2] have been largely applied for qualitative data case, whereas a wide range of approaches can be found for quantitative data case [3, 4].

Barnhart *et al.* [4] proposed to classify these approaches as unscaled and scaled indexes. Unscaled agreement indices summarizes the agreement on the basis of the absolute difference of readings, so that they are expressed in the same units as the outcome analyzed. Among these approaches, we can find the total deviation index (TDI) [5, 6] or the limits of agreement [7]. On the other hand, scaled agreement indices appear to be standardized and have adimensional indices. Some examples are the intraclass correlation coefficient (ICC) [8, 9], the concordance correlation coefficient [10, 11], the coefficient of individual agreement, [12-14] or the within-subject coefficient of variation [15].

Furthermore, from a mathematician point of view, there are two basic ways to measure the distance between two points: absolute or squared distances. This fundamental difference leads to differences in data analyses and in formulating conclusions about concordance measurements. Thus, the approaches can be also categorized into two families depending on how the distance among measurements is obtained, one family is derived on the basis of absolute distance (e.g., TDI), and another family is derived on the basis of squared distance (e.g., ICC).

However, because all the approaches are able to assess the closeness among the readings taken on the same subject, the conclusions about the degree of concordance should be similar regardless the approach applied. Thus, the election of one method should just rely on the easiness of its interpretation when applying to any particular data.

Here, two of these indices (TDI and ICC) will be applied and compared in two data sets and their results and interpretation will dramatically differ. Although the ICC is defined on both between and within-subjects variability, other agreement indices just involve the within-subjects variability. The TDI is one of such statistics that in addition incorporates a predetermined criterion. Thus, the comparison between TDI and ICC applied to different scenarios will unravel what these approaches are really assessing. Furthermore it will be highlighted that the key issue in selecting the right approach to measuring concordance (as in any data analysis) is to understand the right question behind the research hypothesis.

The manuscript is organized as follows. Section 2 contains the data examples of two real applications: the first related to repeatability of individual nutrient allocation strategy by stable isotope signatures; the second consists of the assessment of the concordance of two functional magnetic resonance images (fMRI). TDI and ICC are defined in Section 3 and applied to data in Section 4. Finally, the results are discussed in Section 5.

### Discussion

- Top of page
- Abstract
- Introduction
- Examples
- Definition of intraclass correlation coefficient and total deviation index
- Results
- Discussion
- Acknowledgement
- References

The assessment of the agreement or concordance among different readings can be carried out using different approaches [3, 4]. The choice of one particular approach, aside the nature of data (qualitative or quantitative), mainly relies on the interpretation and comprehension of the results. Thus, it is expected that the conclusions concerning the assessment of agreement should be the same independently the approach applied. However, in this work we have seen that the conclusion on the concordance or reliability differed whether the ICC or the TDI was applied.

Usually the degree of agreement is assessed by means of an index that summarizes the amount of concordance shown by data. All of these indexes share the characteristic of including the within-subject variability (WSV). So the essential component of the agreement is WSV, because a WSV of 0 implies that all the readings from the same subject are equal. Thus, the indexes as different ways of expressing the WSV.

Furthermore, most of the indexes just account for the within-subject differences. For example, the TDI, the within-subject coefficient of variation, or the coefficient of individual agreement. We will call this class of indexes as *pure agreement indexes* (PAI).

On the other hand, ICC comprises a family of coefficients because the exact expression depends on the variance components included in the associated linear mixed model. Nevertheless, given that the total variance is the addition of the between-subjects plus the WSVs, the general expression of the ICC is the ratio of the between-subjects variance to the total variance:

therefore the WSV is linked to the total variance by introducing the between-subjects variance (BSV). The ICC works as indicator of agreement, because in case of perfect agreement, that is, *WS* = 0, the ICC takes the value of 1. What is then the benefit of linking the WSV to the BSV? At first glance one would say that there is no benefit: the result is more difficult to interpret than that from PAI and, what is more disturbing as we saw in the results section, we could arrive to contradictory conclusions.

However, the reason of the disagreement among the PAI and the ICC is they are measuring different issues. It is clear that PAI assesses agreement. So, what is evaluating the ICC? Figure 7 can shed some light on this question where the simulated probability density function of four subjects are drawn. Three settings of between and within subjects variance are considered. The first setting consists of Normal distributions with means 5, 7, 9, and 11 expressing the between-subjects variability. The four distributions have a common variance of 3 that represents the within-subject variability. In the second setting the subject mean remains the same but the within-subject variance is lowered to 1. Finally, in the third setting, the between-subjects variance is decreased by getting the subject mean closer with values of 5, 6, 7, and 8. These figures show that a higher ICC and lower TDI in Figure 7(b) with relation to Figure 7(a). So one could conclude that agreement is higher in (b). However, when comparing Figure 7(b) with Figure 7(c) the ICC is higher in scenario (b), whereas the TDI remains the same. Concerning Figure 7(a) against 7(c), the ICC is similar, whereas TDI is lower in combination (c). Considering that the TDI is a *pure agreement index* we could argue that ICC is evaluating a different concept. Furthermore, the ICC includes the WSV so the agreement is an essential component in the ICC. Nevertheless, the ICC evaluates a concept that goes beyond the agreement. This concept is *distinguishability*, and it is the consequence of placing the WSV against the BSV. Therefore *distinguishability* can be defined as the ability to identify and discriminate subjects by their own values.

Figure 7(a) shows a scenario with high BSV and WSV. That leads to a situation when it is difficult to *distinguish* subjects by means of their observed values as a consequence of a high degree of overlapping of their probability density functions. When moving to Figure 7(b), the *spread* of subjects (BSV) is the same but the WSV is lower. Thus, the agreement is higher and it is easier to distinguish subjects. In scenario (c) the degree of agreement is the same than that of scenario (b) but the lower BSV leads to a worse distinguishability that is similar than that of scenario (a).

It is relevant to note that the concept of distinguishability has been already applied to qualitative scales [28, 29] in the sense of distinguishability of categories. Thus, *distinghishability* concerns the agreement of the readings for the same subject as well as the spread of subjects. An extreme situation would be that where the probability density function was the same for all subjects. This would be a total overlapping situation. The BSV would be 0 leading to a ICC of 0 independently of the WSV, that is, the degree of agreement.

The ICC is still a valid index to compare the degree of agreement among different datasets if their BSV are similar and there is a reasonable spread of subjects with regard to the outcome analyzed, that is, the BSV express the variability of the outcome in the population. However, when these conditions are not fulfilled, the validity of the ICC as indicator of the agreement is questionable and a *pure agreement index* would be preferable.

Concerning the examples analyzed in Section 4, in the case of the intra-clutch repeatability the Columbretes location showed lower values of BSV and WSV. So, the agreement is higher than that of the Ebro Delta location, but it goes with a lower spread of subjects. Therefore the distinguishability is lower in Columbretes location. However, what was of interest in this study? The researchers were not concerned whether the agreement of *δ*^{15}*N* was higher or lower in one particular location. Here, they were interested in appraising if *δ*^{15}*N* was a good marker of individual nutrient allocation strategy. A high ICC would mean the following: (i) the subjects show a different behavior; and (ii) the behavior of each subject is consistent across repeated measures. Summing up, the subjects are distinguishable by their *δ*^{15}*N*. Thus, in this study, the ICC is the appropriate index, whereas a pure agreement index would not be suitable.

With regard to the fMRI example, the discrepancy between the ICC and the TDI is explained by the variation of the BSV across subjects (voxels) because of the spatial correlation. In this case example, the ICC informs on the level of voxel distinguishability in the region. Therefore, the observed TDI and ICC patterns can be explained by the plausible biological situation where the central voxels in the SMA are highly activated from session to session with relatively the same magnitude (low TDI), yet indistinguishable between each other (low ICC). In this case, the researchers were interested in evaluating the degree of agreement between sessions. Thus, taken as agreement index, here the ICC may be a misleading measure and the TDI should be chosen.

Hence, in front of a discrepancy between a PAI, as the TDI, and the ICC, the interpretation of the agreement should be based on the PAI. Additionally it should be concluded that the ICC is assessing the distinguishability among subjects that involves at the same time: (1) the spread of subjects in relation to the analyzed outcome; and (2) the agreement among repeated measurements.