SEARCH

SEARCH BY CITATION

Keywords:

  • concordance;
  • agreement;
  • distinguishability;
  • repeated measures;
  • linear mixed models;
  • repeatability;
  • functional magnetic resonance imaging;
  • intraclass correlation coefficient;
  • total deviation index

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References

The analysis of concordance among repeated measures has received a huge amount of attention in the statistical literature leading to a range of different approaches. However, because all the approaches are able to assess the closeness among the readings taken on the same subject, the conclusions about the degree of concordance should be similar regardless the approach applied. Here, two indices to assess the concordance among continuous repeated measures, the intraclass correlation coefficient and the total deviation index, are applied and compared in two case examples. The first example concerns the repeatability of individual nutrient allocation strategy assessed by stable isotope analysis. The second example dealt with the assessment of the concordance of functional magnetic resonance imaging data that shows spatial correlation. The results differ depending upon the approach applied leading to contradictory conclusions about the degree of concordance. The reason behind these results is discussed reaching the conclusion that the total deviation index is just assessing agreement among repeated measurements, whereas the intraclass correlation coefficient assesses the concept of distinguishability among subjects that involves agreement among repeated measurements and spread of subjects at once. Therefore, the best way to select the right approach is to understand the right question behind the research hypothesis. Copyright © 2013 John Wiley & Sons, Ltd.

Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References

The aim of an agreement assay is to assess the degree of concordance among some repeated readings taken on a sample of subjects. Each reading can be measured under the same conditions or alternatively such conditions can change across readings. For example, the readings may be taken either by the same observer (same condition) or a group of different observers (different conditions). As a consequence it is possible to find the agreement assay under different names depending on such conditions that change across readings. Hence repeatability or reliability are concepts related to the agreement among repeated measures under the same measuring conditions. On the other hand, concordance usually refers to the agreement of readings taken by different observers.

However, the methodology to assess the agreement usually depends on the type of data: qualitative or quantitative. Hence, Kappa indexes [1, 2] have been largely applied for qualitative data case, whereas a wide range of approaches can be found for quantitative data case [3, 4].

Barnhart et al. [4] proposed to classify these approaches as unscaled and scaled indexes. Unscaled agreement indices summarizes the agreement on the basis of the absolute difference of readings, so that they are expressed in the same units as the outcome analyzed. Among these approaches, we can find the total deviation index (TDI) [5, 6] or the limits of agreement [7]. On the other hand, scaled agreement indices appear to be standardized and have adimensional indices. Some examples are the intraclass correlation coefficient (ICC) [8, 9], the concordance correlation coefficient [10, 11], the coefficient of individual agreement, [12-14] or the within-subject coefficient of variation [15].

Furthermore, from a mathematician point of view, there are two basic ways to measure the distance between two points: absolute or squared distances. This fundamental difference leads to differences in data analyses and in formulating conclusions about concordance measurements. Thus, the approaches can be also categorized into two families depending on how the distance among measurements is obtained, one family is derived on the basis of absolute distance (e.g., TDI), and another family is derived on the basis of squared distance (e.g., ICC).

However, because all the approaches are able to assess the closeness among the readings taken on the same subject, the conclusions about the degree of concordance should be similar regardless the approach applied. Thus, the election of one method should just rely on the easiness of its interpretation when applying to any particular data.

Here, two of these indices (TDI and ICC) will be applied and compared in two data sets and their results and interpretation will dramatically differ. Although the ICC is defined on both between and within-subjects variability, other agreement indices just involve the within-subjects variability. The TDI is one of such statistics that in addition incorporates a predetermined criterion. Thus, the comparison between TDI and ICC applied to different scenarios will unravel what these approaches are really assessing. Furthermore it will be highlighted that the key issue in selecting the right approach to measuring concordance (as in any data analysis) is to understand the right question behind the research hypothesis.

The manuscript is organized as follows. Section 2 contains the data examples of two real applications: the first related to repeatability of individual nutrient allocation strategy by stable isotope signatures; the second consists of the assessment of the concordance of two functional magnetic resonance images (fMRI). TDI and ICC are defined in Section 3 and applied to data in Section 4. Finally, the results are discussed in Section 5.

Examples

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References

Individual nutrient allocation strategy by δ15N isotope signatures

In ecology and evolutionary biology, the repeatability of a specific trait is related to sensitivity to environmental influences as well as the learning and experience process. Thus, it is of a deep interest to assess the repeatability of morphological, physiological, and behavioral traits. In this case, example researchers were interested in evaluating the repeatability of the origin of nutrients used during clutch production in yellow-legged gull (Larus michahellis). The nitrogen stable isotope signature δ15N of the albumen egg was used as a surrogate of the exogenous (i.e., diet) or endogenous (i.e., female reserves) origin of nutrients allocated to this egg compartment. The yellow-legged gull typically shows a modal clutch size of three eggs, with the last-laid egg being significantly smaller than the first two. So the albumen δ15N from two first eggs were used to assess the degree of repeatability. It is assumed that albumen isotopic signal reflects that of nutrients used during the two-day period of albumen formation, therefore, isotopic values from first and second eggs can be considered repeated measures from two consecutive and non-overlapping short periods. Data was collected at two locations in the northwestern Mediterranean area in different breeding episodes. Specifically, 18 female yellow-legged gulls were analyzed at Columbretes Islands in year 2004 and 43 female yellow-legged gulls at Ebro Delta in year 2008.

Functional magnetic resonance imaging data

Functional magnetic resonance imaging (fMRI) data was obtained from one adult that underwent two scanning sessions separated by 3 months. The subject was submitted to a working-memory task of 2 min that required the monitoring of a series of auditory stimuli (a pseudo random sequence of numbers) and identification of targets (number 8) while viewing a fixation cross in the center of a projected computer screen. Up to 36 3D-images of the brain were obtained and properly processed.

The sensomotory motor area (SMA) was highly activated during the performance of the task. Thus a volume of interest, centered in the SMA and radius of 10 cm, was extracted. This consisted on a 11 × 11 × 11 volume of highly activated voxels. The activity was summarized in a t-value that resulted from adjusting the experimental model to the hemodynamic response (see [16] for further details). The aim of the study was to assess the concordance of the t-values between the two sessions in specific regions of high brain activity.

Definition of intraclass correlation coefficient and total deviation index

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References

This section is devoted to briefly describe the ICC and the TDI. As it will be seen, the exact expression of these indexes depends upon the model assumed for the data. So, the description here is going to be rather generic and concrete expressions will be derived when applied to example data.

Intraclass correlation coefficient

Let us assume a continuous variable is obtained in a repeated measures design, so n subjects are measured k times. Under this design, variability of data can be split out in between and within-subjects variability.Between-subjects variability express the mean differences among subjects, whereas within-subjects variability refers to the differences among the readings from the same subject.

The simpler design would be that where the repeated measures are considered replicates, that is, the condition of measuring the repeated measures remains the same and the readings from the same subject are therefore interchangeable. Let Y ij be the j-th reading obtained from the i-th subject and assume the following linear model:

  • display math

where μ is the overall mean, αi is the subject random effect assumed to follow a Normal distribution with mean 0 and variance σα2, and εij is the random error also assumed to follow a Normal distribution with mean 0 and variance σε2. It is also assumed that the random effects are independent of any other parameter of the model.

Furthermore the model can be more complex by introducing additional sources of variability. For example, in the concordance between observers (or methods) a new effect must be introduced in the model to account for the bias among observers. Hence, Y ijk stands now for the k-th reading taken on i-th subject by the j-th observer. The associated linear model becomes

  • display math

where βj stands for the observer effect that can be either fixed or random effect.

Further modifications of the model can be introducing a subject-by-observer interaction, heterogeneity of the random error variance or accounting for the dependence of the random error in a longitudinal design. Thus, the flexibility of linear mixed models may be used to accommodate the particularities of the readings. For instance, the design of the study may be more complex, including more than one reading by subject and observer to differentiate the interaction effects from the random error. In addition, the presence of replicates will lead to more accurate estimates of the random error variance.

The ICC is defined as the correlation between any pair data from the same subject (class). Thus, if Y ijk and inline image is a pair of data from subject i-th, the general expression of the ICC is as follows:

  • display math

The exact expression of the ICC in terms of the variance components is going to depend on the linear mixed model assumed [17].

Inferential aspects depend on the estimation method used. If the Bayesian paradigm is applied [18], credible intervals for the ICC are obtained by means of the posterior distribution of the ICC. On the other hand, the most frequentist approach applied to estimate the parameters of the linear mixed model is the restricted maximum likelihood (REML). In that case, the ICC is estimated using the variance components estimates. Asymptotic normality is assumed for inference purposes through the inverse hyperbolic tangent transformation or Fisher's Z transformation

  • display math

which has been shown as better approximation to Normal distribution [19].

Total deviation index

Given a pair of measures from the same subject, Y ij, inline image, the TDI is defined as the boundary κp that captures a large proportion p of any paired measurement differences, inline image, within the boundary [5]

  • display math

Assuming that the distribution of D is normal with mean inline image and variance inline image the TDI is defined as [3]

  • display math

where inline image stands for the inverse of the chi-square distribution with 1 degree of freedom and noncentrality parameter inline image. The expressions of μD and inline image depend on the model assumed for data [5].

Concerning inference aspects, they usually consist of computing an upper bound of κp with a (1 − α)% confidence level, that is usually denoted as UB(1 − α)%(κp).

The main difficulty on estimating the upper bound of the TDI stems from the fact there is not a closed expression to estimate it. However, some approaches have been proposed that successfully estimate the TDI's upper bound [5, 6, 20, 21].

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References

Individual nutrient allocation strategy by δ15N isotope signatures

The aim is to assess the degree of repeatability of δ15N between the first and second egg of a clutch at each location. A repeatability changing across locations may be understood as a different nutrient allocation strategy.

The mean and SD of each egg and its difference can be found in Table 1. Additionally Figure 1 shows observation pairs corresponding to the two locations. Figure 2 shows the quantile–quantile plot of the difference between first and second egg data for each location. The plots illustrate that it is reasonable to assume the normality of the difference at two locations.

Table 1. Mean and SD of δ15N of albumen corresponding to first and second eggs and their difference.
 Columbretes Is. 2004Ebro Delta 2008
 MeanSDMeanSD
Egg 111.030.4010.650.76
Egg 211.220.3810.910.84
Egg 1 – Egg 2 − 0.190.40 − 0.260.60
image

Figure 1. Scatter plot of δ15N data.

Download figure to PowerPoint

image

Figure 2. δ15N data. Q-Q plot of the difference between first and second egg.

Download figure to PowerPoint

The following linear mixed model is fitted for each location data:

  • display math

where Y ij stands for the nitrogen signature of female i and egg j, with j = 1,2. It is assumed that the female effect, αi, and the random effect , eij, are distributed as N(0,σα2) and N(0,σe2), respectively.

The estimates of the ICC and TDI are obtained using the variance components estimates, inline image and inline image. The upper bound of the TDI is computed using the tolerance interval approach [21]. The results are shown in Table 2. A greater ICC point estimate is obtained in Ebro Delta location (0.678) than that in Columbretes (0.448). So it could be deduced that albumen nitrogen isotopic signature has a greater repeatability in Ebro Delta.

Table 2. Intra clutch repeatability results. The table shows the variance components estimates, intraclass correlation coefficient (ICC) and lower (LL) and upper (UL) limits of the 95% confidence interval, and the 95% upper bound of the 90% TDI.
Locationinline imageinline imageICCLL 95%UL 95%UB95%(κ0.9)
Columbretes0.0800.0990.448 − 0.0320.7601.04
Ebro Delta0.4440.2110.6780.4820.8101.33

However, if the 95% upper bound TDI is used, the lower value is that of Columbretes, which means a greater agreement of nitrogen albumen values.

Thus, the results are contradictory depending upon the index used. One could argue that the ICC 95% confidence intervals overlap, so it is not possible to declare which location has a greater repeatability by means of the ICC. However, as it will be further addressed in the discussion section, the reason of this controversy goes beyond sampling variability issues.

Functional magnetic resonance imaging data

Modelling

As it was introduced in Section 2, the goal of this analysis was to establish the degree of concordance of brain activity between the two sessions. With this aim the brain activity was summarized after fMRI data processing as a t-value measured at each voxel.

The mean and SD of each session and its difference are shown in Table 3, whereas Figure 3 shows the scatter plot of pair data. Figure 4 includes the quantile–quantile plot of the difference of data between sessions. Thus, it is also reasonable in this case to assume that the condition of normality of the difference is fulfilled.

Table 3. Mean and SD of fMRI data corresponding to first and second sessions and their difference.
 MeanSD
Session 10.681.64
Session 21.401.83
Session 1 – Session 2 − 0.721.56
image

Figure 3. Scatter plot of functional magnetic resonance imaging data.

Download figure to PowerPoint

image

Figure 4. Functional magnetic resonance imaging data. Q-Q plot of the difference between first and second session.

Download figure to PowerPoint

A first model considered to fit data is as follows:

  • display math

where Y ij is the t-value for voxel i and session j with i = 1, … ,121 and j = 1,2; μ is the overall mean; αi stands for the voxel effect; βj accounts for the constant bias between the two sessions; and εij is the random error.

At first we could assume that inline image and inline image; however, this kind of data usually present spatial dependence among voxels. Thus, nearer voxels tends to show more similar values than farther voxels. To prove this hypothesis, the spatial test based on Moran's I [22] was applied to the residuals of the former linear model. The estimate of the Moran's I indicated a high correlation among voxels ( I = 90.13) and the hypothesis of spatial independence was rejected (p < 0.0001).

The model is therefore modified to afford the spatial dependence among voxels by modifying the probability distribution of the voxel effect. One model that accounts for the spatial dependence is the conditional autoregressive model (CAR) [23, 24]. The linear mixed model has the same effects, but the probability distribution assumed for the voxel effect is as follows:

  • display math

where A = Q − 1 denotes the inverse of n × n adjacency matrix of the voxels. The elements of Q are defined as follows:

  1. Qii = ni, where ni is the number of adjacent voxels to voxel i-th.

  2. Qij = − 1 if voxels i-th and j-th are adjacent.

  3. Qij = 0 if voxels i-th and j-th are not adjacent.

Model parameters were estimated via fully Bayesian by using the Markov chain Monte Carlo approach [25]. Gibbs sampling was used to obtain posterior distributions, and two chains were run from dispersed starting values. Non informative prior distributions for the parameters were specified. Convergence diagnosis was assessed by the Geweke [26] and Gelman–Rubin statistics [18]. Several models were considered and compared using the deviance information criterion [27]. The best model was that with voxel CAR effect and voxel-session interaction CAR effect. Thus, the model fitted is as follows:

  • display math

where inline image, inline image, and inline image.

Note that it would not be possible to differentiate voxel-session interaction from random error in the frequentist paradigm. However, in the fully Bayesian framework both variance components can be differentiated and estimated.Voxel-session interaction also appears as a CAR random effect, meaning that the spatial voxel effect changes between sessions.

This model implies that

  1. Brain activity is different at each voxel.

  2. Nearer voxels show closer values of activity than farther voxels.

  3. The voxel effect changes from first to second sessions.

  4. Nearer voxels have closer values of change from 1st to 2nd sessions than farther voxels.

The estimates of the model are shown in Table 4. The variance component σβ2 stands for between-sessions variability. It is estimated as follows [19]:

  • display math
Table 4. Mean and SD of posterior distributions of model parameters.
 μσα2σβ2σγ2σε2
Mean0.74216.590.275517.190.0022
SD0.00430.63770.00320.67700.0017

Where n stands for the number of voxels.

The WINBUGS code to estimate the model, variance components, and intraclass correlation coefficient is available at http://cvfitxers.ub.edu/docs/102061/modelCARINTCAR.bug.

Intraclass correlation coefficient and total deviation index

One important aspect to take into account before deriving the expressions of the ICC and TDI is that, even if the voxels show spatial dependency, data conditioned to voxel can be considered as independent.

Thus, the expression of the ICC to assess the reliability of the fMRI data between sessions is as follows:

  • display math

where aii is the i-th diagonal element of A.

So that the main result of the spatial dependence among voxels is that every voxel may have a different ICC depending on their neighboring structure.

With regards to the TDI let us explore the expression of the difference of two data from the same subject under the former model with CAR subject and subject–session interaction effects:

  • display math

Thus, the difference of two data from the same voxel depends on the fixed session bias, the random error, and the spatially structured subject–session interaction. This result leads to a TDI that is going to spatially vary among voxels as the ICC case.

Hence, the 90% TDI upper bounds of each voxel were estimated by applying the tolerance interval approach introduced by [21].

The estimates of the ICC and TDI upper bound for plane 5 of the 3D image are shown in Figures 5 and 6, respectively.

image

Figure 5. Intraclass correlation coefficient estimates for plane 5.

Download figure to PowerPoint

image

Figure 6. Total deviation index estimates for plane 5.

Download figure to PowerPoint

By using the ICC estimates, one could argue that the agreement is higher at the edge of the image, in contradiction to the TDI estimates that show better agreement in the center of the image. The reason of such apparent contradiction will be addressed in the discussion section.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References

The assessment of the agreement or concordance among different readings can be carried out using different approaches [3, 4]. The choice of one particular approach, aside the nature of data (qualitative or quantitative), mainly relies on the interpretation and comprehension of the results. Thus, it is expected that the conclusions concerning the assessment of agreement should be the same independently the approach applied. However, in this work we have seen that the conclusion on the concordance or reliability differed whether the ICC or the TDI was applied.

Usually the degree of agreement is assessed by means of an index that summarizes the amount of concordance shown by data. All of these indexes share the characteristic of including the within-subject variability (WSV). So the essential component of the agreement is WSV, because a WSV of 0 implies that all the readings from the same subject are equal. Thus, the indexes as different ways of expressing the WSV.

Furthermore, most of the indexes just account for the within-subject differences. For example, the TDI, the within-subject coefficient of variation, or the coefficient of individual agreement. We will call this class of indexes as pure agreement indexes (PAI).

On the other hand, ICC comprises a family of coefficients because the exact expression depends on the variance components included in the associated linear mixed model. Nevertheless, given that the total variance is the addition of the between-subjects plus the WSVs, the general expression of the ICC is the ratio of the between-subjects variance to the total variance:

  • display math

therefore the WSV is linked to the total variance by introducing the between-subjects variance (BSV). The ICC works as indicator of agreement, because in case of perfect agreement, that is, WS = 0, the ICC takes the value of 1. What is then the benefit of linking the WSV to the BSV? At first glance one would say that there is no benefit: the result is more difficult to interpret than that from PAI and, what is more disturbing as we saw in the results section, we could arrive to contradictory conclusions.

However, the reason of the disagreement among the PAI and the ICC is they are measuring different issues. It is clear that PAI assesses agreement. So, what is evaluating the ICC? Figure 7 can shed some light on this question where the simulated probability density function of four subjects are drawn. Three settings of between and within subjects variance are considered. The first setting consists of Normal distributions with means 5, 7, 9, and 11 expressing the between-subjects variability. The four distributions have a common variance of 3 that represents the within-subject variability. In the second setting the subject mean remains the same but the within-subject variance is lowered to 1. Finally, in the third setting, the between-subjects variance is decreased by getting the subject mean closer with values of 5, 6, 7, and 8. These figures show that a higher ICC and lower TDI in Figure 7(b) with relation to Figure 7(a). So one could conclude that agreement is higher in (b). However, when comparing Figure 7(b) with Figure 7(c) the ICC is higher in scenario (b), whereas the TDI remains the same. Concerning Figure 7(a) against 7(c), the ICC is similar, whereas TDI is lower in combination (c). Considering that the TDI is a pure agreement index we could argue that ICC is evaluating a different concept. Furthermore, the ICC includes the WSV so the agreement is an essential component in the ICC. Nevertheless, the ICC evaluates a concept that goes beyond the agreement. This concept is distinguishability, and it is the consequence of placing the WSV against the BSV. Therefore distinguishability can be defined as the ability to identify and discriminate subjects by their own values.

image

Figure 7. Probability density plots of four subjects under different conditions of between-subjects variance (BSV) and within-subjects variance (WSV). The y-axis represents the probability density, whereas the x-axis shows the simulated values. The intraclass correlation coefficient and the total deviation index are reported above the plots.

Download figure to PowerPoint

Figure 7(a) shows a scenario with high BSV and WSV. That leads to a situation when it is difficult to distinguish subjects by means of their observed values as a consequence of a high degree of overlapping of their probability density functions. When moving to Figure 7(b), the spread of subjects (BSV) is the same but the WSV is lower. Thus, the agreement is higher and it is easier to distinguish subjects. In scenario (c) the degree of agreement is the same than that of scenario (b) but the lower BSV leads to a worse distinguishability that is similar than that of scenario (a).

It is relevant to note that the concept of distinguishability has been already applied to qualitative scales [28, 29] in the sense of distinguishability of categories. Thus, distinghishability concerns the agreement of the readings for the same subject as well as the spread of subjects. An extreme situation would be that where the probability density function was the same for all subjects. This would be a total overlapping situation. The BSV would be 0 leading to a ICC of 0 independently of the WSV, that is, the degree of agreement.

The ICC is still a valid index to compare the degree of agreement among different datasets if their BSV are similar and there is a reasonable spread of subjects with regard to the outcome analyzed, that is, the BSV express the variability of the outcome in the population. However, when these conditions are not fulfilled, the validity of the ICC as indicator of the agreement is questionable and a pure agreement index would be preferable.

Concerning the examples analyzed in Section 4, in the case of the intra-clutch repeatability the Columbretes location showed lower values of BSV and WSV. So, the agreement is higher than that of the Ebro Delta location, but it goes with a lower spread of subjects. Therefore the distinguishability is lower in Columbretes location. However, what was of interest in this study? The researchers were not concerned whether the agreement of δ15N was higher or lower in one particular location. Here, they were interested in appraising if δ15N was a good marker of individual nutrient allocation strategy. A high ICC would mean the following: (i) the subjects show a different behavior; and (ii) the behavior of each subject is consistent across repeated measures. Summing up, the subjects are distinguishable by their δ15N. Thus, in this study, the ICC is the appropriate index, whereas a pure agreement index would not be suitable.

With regard to the fMRI example, the discrepancy between the ICC and the TDI is explained by the variation of the BSV across subjects (voxels) because of the spatial correlation. In this case example, the ICC informs on the level of voxel distinguishability in the region. Therefore, the observed TDI and ICC patterns can be explained by the plausible biological situation where the central voxels in the SMA are highly activated from session to session with relatively the same magnitude (low TDI), yet indistinguishable between each other (low ICC). In this case, the researchers were interested in evaluating the degree of agreement between sessions. Thus, taken as agreement index, here the ICC may be a misleading measure and the TDI should be chosen.

Hence, in front of a discrepancy between a PAI, as the TDI, and the ICC, the interpretation of the agreement should be based on the PAI. Additionally it should be concluded that the ICC is assessing the distinguishability among subjects that involves at the same time: (1) the spread of subjects in relation to the analyzed outcome; and (2) the agreement among repeated measurements.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Examples
  5. Definition of intraclass correlation coefficient and total deviation index
  6. Results
  7. Discussion
  8. Acknowledgement
  9. References