Agreement studies in obstetrics and gynaecology: inappropriateness, controversies and consequences
Dr C. Costa Santos, Faculty of Medicine, University of Porto, Biostatistics and Medical Informatics, Al. Prof. Hernâni Monteiro, Porto, 4200-319 Portugal.
A literature review was performed to assess how agreement has been measured in obstetrics and gynaecology. Inappropriate or controversial measures were often used. The possible consequences of these inadequacies on validity studies and randomised controlled trials are shown in case examples. The association of two measures, proportions of agreement and kappa for categorical variables, and limits of agreement and intraclass correlation coefficient for continuous variables, is encouraged in agreement studies, until new and better ways of assessing agreement are found.
The importance of agreement studies in medicine has been stressed in leading medical journals.1,2 Reproducibility studies assess the agreement of measurements obtained by two or more observers, or by one observer on two or more occasions under the same conditions. These are termed inter-observer and intra-observer agreement, respectively. Validity studies assess the agreement of measurements with a reference measure (the gold standard), and should only be performed after a reproducibility study. Randomised controlled trials should only be performed after adequate reproducibility and validity studies.1–4 The important question is: how can researchers, doctors and lawyers interpret similar studies with results that disagree, and take decisions based on these results?
However, the importance of agreement studies is not always given due recognition. Moreover, in many cases, inappropriate or controversial measures of agreement are used, particularly regarding reproducibility and validity studies with the same scale for the test under scrutiny and the gold standard.1–4
In order to help clear up this problem, we decided to assess how agreement has been measured in reproducibility and validity studies with the same scale for the test under scrutiny and the gold standard in three world-leading journals in obstetrics and gynaecology. We also illustrate how poor reproducibility can dramatically influence the results of validity studies, randomised controlled trials and meta-analyses.
All abstracts from the American Journal of Obstetrics and Gynecology, the British Journal of Obstetrics and Gynaecology and Obstetrics and Gynaecology published between the beginning of 1997 and the end of 2001 were reviewed, excluding letters and editorials. The review was performed by the same statistician, who is experienced in agreement studies. Another statistician performed the same review for one random year for each journal and found the same results.
All reproducibility and validity studies with the same scale for the test under scrutiny and the gold standard were selected, excluding validity studies considering dichotomic variables (sensitivity and specificity studies).The full text of all selected articles was analysed and discussed by two statisticians to discover what measures were used: kappa (κ), proportions of agreement (PA), intraclass correlation coefficient (ICC), limits of agreement (LA) or any other measure.
κ is the observed agreement (i.e. PA, minus the agreement expected by chance, divided by perfect agreement minus the agreement expected by chance). Thus, this measure provides measurement of agreement beyond chance in nominal scales. In ordinal scales, and when the discrepancy between some categories is much worse than between others, weighted κ corrects the observed agreement not only for chance but also allows credit for partial agreement.3 ICC is a measure of agreement for continuous variables, and measures both the variation between the observers and the variation between subjects. Thus, ICC is the proportion of the total variance that is due to the variation between the subjects.3 To measure agreement in continuous variables, Bland and Altman recommended LA, which involves a scatterplot of difference between the measurements obtained by the two observers against the mean of the measurements for each subject, and LA is expressed as a range which encompasses 95% of these differences.1,2
To demonstrate how poor reproducibility can influence the results of validity studies, randomised controlled trials and meta-analysis, we studied two hypothetical case examples.
The literature search in the three leading journals over the five-year period considered revealed 23 reproducibility or validity studies with the same scale for the test under scrutiny and the gold standard, 11 using categorical variables and 15 using continuous variables. Some studies involved both kinds of variables. The list of agreement studies analysed is available from the authors upon request. With categorical variables, agreement was measured with κ only or weighted κ only in 73% of cases, never with PA without any other measure, and with both measures in 18% of cases. With continuous variables, agreement was measured with ICC only in 33% of cases, in 13% only with LA and in 27% with both measures. In 9% of cases with categorical variables and 27% with continuous variables, clearly inadequate measures were used such as the Pearson correlation coefficient, Spearman correlation coefficient and χ2.
In Table 1, an example illustrates how poor reproducibility can influence the result of a validity study. A commonly observed observer variation in a diagnostic test led to variations in the test sensitivity as wide as 0 to 100%. How can we interpret the results for these three similar trials in the same eight cases, to access the validity of this diagnosis test, with different results? Is sensitivity of this diagnosis test 0%, as suggested by trial 2 or 100%, as suggested by trial 3?
Table 1. Assessment of the same diagnostic test (positive or negative test) in three trials with the same eight cases. There is poor reproducibility among the three trials. With the final outcome in mind, in trial 1 the sensitivity is 50% (1/2) and false positive rate is 17% (1/6); in trials 2 and 3 sensitivities range between 0 (0/2) and 100% (2/2), respectively, and false positive rates range between 33% (2/6) and 0 (0/6).
If, on the contrary, there was perfect reproducibility in the three trials, for example, trials 2 and 3 equal to trial 1, the sensitivity and the false positive rate would be always the same, 50% and 17%, respectively, in all trials.
In Table 2, an example illustrates how poor reproducibility in a diagnostic test A can lead to a variation in the relative risk in three trials comparing test A plus an 100% effective intervention with test B plus a 100% effective intervention as wide as 0.1 to 2. How can we interpret the results for the three similar trials? Is test A and a 100% effective intervention worse than test B and a 100% effective intervention, as suggested by trial 1 with a relative risk of 2? Or is test A and a 100% effective intervention better than test B and a 100% effective intervention, as suggested in trial 3 with a relative risk of 0.1?
Table 2. Assessment of an outcome prevalence after diagnostic test A plus a 100% effective intervention and after a diagnosis test B plus a 100% effective intervention in three trials. There is no observer agreement in test A among the three trials, whereas there is perfect agreement in test B. The relative risks between test A plus effective intervention and test B plus effective intervention range between 0.1 and 2, in the three trials.
If, on the contrary, there was perfect reproducibility in test A, as there was in test B, the relative risk would always be the same in the three trials.
Our results show that the κ and ICC presented without any other measure are the most common ways used to assess agreement in categorical and continuous variables, respectively. In fact, these measures are advocated by some authors as being the best measures of agreement. However, they have been strongly contested by other authors.1,2κ and ICC measure association rather then agreement1,2 and are affected by prevalence of the categories in assessment and by observer bias, which may result in potentially misleading interpretations.5 On the other hand, LA and PA measure agreement, are not affected by differences in prevalence of the categories in the assessment and may show existing observer bias.1,2 However, they do not take the variation between subjects into account, and their interpretation depends on subjective comparison with the range of the measurements encountered in clinical practice.
Our results also show that the correlation coefficient (Spearman), χ2 test and Pearson correlation coefficient are sometimes used as measures of agreement. It has been demonstrated repeatedly that they measure association, not agreement.1–4 A complete disagreement can exist, having a perfect association with a correlation coefficient of 1. Our examples also show how observer disagreements commonly observed in reproducibility studies may lead to discrepancies in validity studies and randomised controlled trials. Under-estimation, ignorance or, even worse, wrong assumptions about this situation may have obvious implications for research and clinical practice, and also medico-legal consequences. Indeed, how can we interpret the results from most of the studies regarding the sensitivity and specificity of some of the most common screening and diagnostic tests in obstetrics and gynaecology? What is, for example, the sensitivity and specificity of colposcopy and of cervical cytology in detection of CIN II and III? Mitchell et al.6 found a variation in sensitivity and specificity of colposcopy ranging from 87% to 100% and 26 to 87%, respectively, in nine published studies using similar methods. The same authors found a variation in sensitivity and specificity of cervical cytology from 0 to 93% and 0 to 100%, respectively, in 26 published studies using similar methods.6 And, what are, for example, the relative risks of poor outcomes generally supposed to be avoided by interventions based on cardiotocography and mammography? Are they really close to 1, as suggested by most meta-analyses,7,8 or are they just the mean result of discrepant results from the different randomised clinical trial included in meta-analyses?
Agreement studies, involving both reproducibility and validity, should be performed and interpreted in an adequate manner, and until better ways of assessing agreement are developed, we recommend for categorical variables the use of both κ and PA for each category and overall, and for continuous variables, the use of both ICC and LA in reproducibility and validity studies with the same scale for the test under scrutiny and the gold standard.
Accepted 2 September 2004