Early Detection and Diagnosis
Inter-test agreement and quantitative cross-validation of immunochromatographical fecal occult blood tests
Version of Record online: 4 JAN 2010
Copyright © 2010 UICC
International Journal of Cancer
Volume 127, Issue 7, pages 1643–1649, 1 October 2010
How to Cite
Brenner, H., Haug, U. and Hundt, S. (2010), Inter-test agreement and quantitative cross-validation of immunochromatographical fecal occult blood tests. Int. J. Cancer, 127: 1643–1649. doi: 10.1002/ijc.25154
- Issue online: 4 AUG 2010
- Version of Record online: 4 JAN 2010
- Manuscript Accepted: 21 DEC 2009
- Manuscript Received: 26 OCT 2009
- German Research Foundation (Deutsche Forschungsgemeinschaft). Grant Number: Graduiertenkolleg 793
- colorectal cancer;
- early detection;
- fecal occult blood tests;
Immunochromatographical fecal occult blood tests were shown to have higher sensitivity for detecting colorectal neoplasms than the commonly used guaiac-based test. However, positivity rates, sensitivity and specificity vary widely. We aimed to assess the reasons for this heterogeneity. Six dichotomous (qualitative) immunochromatographical tests were used in the same stool samples, taken before cathartic bowel preparation, from 1,330 participants of the German colonoscopy screening program. Positivity rates were determined, and inter-test agreement beyond chance was quantified by kappa coefficients (κ). In addition, kappa coefficients were expressed in relation to their maximum possible values given differences in test positivity rates (κ/κmax). Furthermore, the distribution of fecal hemoglobin concentration was assessed by an additional quantitative test in participants classified as clearly positive, borderline positive or clearly negative according to the qualitative tests. Positivity rates strongly varied from 6.4 to 46.8%. As a result, overall agreement between tests was only poor to moderate, with κ ranging from 0.14 to 0.61. However, apart from the different positivity rates, agreement was mostly very high, with κ/κmax ranging from 0.53 to 1.00, and exceeding 0.70 in 12 of 15 cases. Distribution of fecal hemoglobin concentrations in the various categories strongly varied across tests. The observed patterns suggest that the strongly different positivity rates essentially reflect different cutoff levels of tests with otherwise very high inter-test agreement. Definition of cutoffs is a critical issue in the application of immunochromatographical tests and should be redefined for several of these tests.
Fecal occult blood tests (FOBTs) are widely recommended and used for early detection of colorectal cancer (CRC) and its precursors.1 In particular, the guaiac-based FOBT, whose application has been shown to reduce incidence and mortality of CRC under trial conditions,2 has been used for decades. In recent years, there has been increasing interest in immunochemical FOBTs, which overcome several analytical problems and might show better diagnostic performance compared with the guaiac-based FOBT.3–9 Both quantitative and qualitative (dichotomous) tests have been developed and are propagated for population-based screening. The latter are chromatographical tests requiring a visual interpretation of test results as positive or negative. A major advantage of immunochromatographical tests is their easy on-site implementation without the need of specific laboratory equipment. However, recent studies have disclosed major differences in the analytical performance of various tests.7, 10 In particular, we have shown in a study among 1,319 participants of screening colonoscopy that positivity rates, sensitivity and specificity of 6 qualitative immunochromatographical FOBTs, taken on stool samples from a single bowel movement, were varying widely across tests.7 We aimed to explore the reasons for this heterogeneity by (i) assessing inter-test agreement between these tests, overall and after accounting for the difference in positivity rates, and (ii) by comparing results to an additional quantitative measurement of hemoglobin in the stool samples.
Material and Methods
Study design and study population
Details of the study design and the study population have been described elsewhere.7 Briefly, 1,785 participants of the German colonoscopy screening program were recruited in 20 gastroenterology practices between January 2006 and December 2007 according to a protocol approved by the ethics committees of the Medical Faculty Heidelberg of the University of Heidelberg and of the physicians' chambers of Baden-Württemberg, Rheinland Pfalz and Hessen. Patients were informed about the study at a preparatory visit in the practice, typically about 1 week before colonoscopy. Upon informed consent, patients were handed out stool collection instructions and devices, including a small container (60 mL) for collecting a stool sample. Participants were asked to provide a stool sample before bowel preparation for colonoscopy. Stool from 1 bowel movement was to be collected without any specific recommendations for dietary or medicinal restrictions. In the written sampling instruction, we asked participants to keep the stool-filled container in a provided plastic bag in the freezer or, if not possible, in the refrigerator. On the day of colonoscopy, the stool-filled container was rendered at the gastroenterological practice and transferred to a central laboratory in a cool chain (−20°C). Furthermore, patients were asked to fill out a standardized questionnaire. Colonoscopy and histology reports were collected, and relevant data were extracted in a standardized manner. Latter was done independently by 2 trained investigators who were blinded with respect to test results, and potential discrepancies were resolved by consensus.
The stool-filled containers were thawed at a median interval of 4 days on arrival at the central laboratory. The 6 different qualitative immunochemical FOBTs were done and evaluated in the same session by a single, trained investigator who was blinded with respect to colonoscopy findings. The following tests were applied: immoCARE-C, CAREdiagnostica, Voerde, Germany; FOB advanced, ulti med, Ahrensburg, Germany; PreventID CC, Preventis, Bensheim, Germany; Bionexia FOBplus, DIMA, Göttingen, Germany; QuickVue iFOB, Quidel, San Diego, California and Bionexia Hb/Hp Complex, DIMA, Göttingen, Germany. Each test was rated as clearly positive, borderline positive or clearly negative based on visual comparison with sample images of positive and negative test results provided by the manufacturers. In our main analysis, both borderline positive and clearly positive tests were interpreted as positive according to the manufacturers' instructions.
For comparison, stool samples were additionally tested by a quantitative immunochemical FOBT. Starting with 1 g of stool, fecal hemoglobin was measured using an automated ELISA according to the manufacturer's instructions (RIDASCREEN® Haemoglobin, R-Biopharm AG, Bensheim, Germany).11 The lower detection limit and the cutpoint for positivity given by the manufacturer are 0.42 and 2 μg/g stool, respectively.
For statistical analysis, the following exclusions were made to ensure conditions of a screening setting and to minimize potential misclassification due to imperfect colonoscopy: visible rectal bleeding or previous positive FOBT result (n = 111); inflammatory bowel disease (n = 13), previous colonoscopy in the past 5 years (n = 117), stool sampling after colonoscopy (n = 65), inadequate bowel preparation for colonoscopy (n = 79) and incomplete colonoscopy (n = 22). In addition, we excluded 48 patients with pseudopolyps or histologically undefined polyps at screening colonoscopy. After application of these exclusion criteria, 1,330 participants of screening colonoscopy were retained for the analysis.
Patients were described with respect to major sociodemographic characteristics, findings at colonoscopy and diagnostic performance of single tests with respect to advanced neoplasms (defined as the presence of CRC or of at least 1 adenoma with at least 1 of the following features: >1 cm in size, tubulovillous or villous components and high-grade dysplasia).
Inter-test agreement was quantified by kappa coefficients in a pairwise manner (i.e., for each combination of 2 different tests). Kappa coefficients quantify agreement beyond chance. First, kappa coefficients for overall agreement, denoted as κ, were calculated. To separate inter-test variation due to differences in positivity rates of the different tests from inter-test variation from other reasons, we then calculated maximum possible values of kappa coefficients given the positivity rates of the different tests (κmax), and we calculated the ratio κ/κmax as an indicator of agreement after accounting for differences in positivity rates.
Next, the distribution (median, interquartile range) of hemoglobin levels according to the quantitative FOBT was assessed for each of the following categories of the test results: clearly positive, borderline positive and clearly negative.
Finally, diagnostic performance of single tests was evaluated, using colonoscopy results as gold standard. Comparative analyses were carried out, in which borderline positive results were either rated as positive (as recommended by the manufacturers) or as negative. Sensitivity was calculated with respect to detection of advanced colorectal neoplasms, a commonly used combined endpoint of CRC and advanced adenomas. In additional analyses, sensitivity was also calculated with respect to detection of large advanced neoplasms only, defined as CRC or adenoma ≥1 cm in size. Specificity was calculated as the proportion of negative tests among study participants in whom no or only hyperplastic polyps were detected at colonoscopy.
Tables 1 and 2 provide a summary of key characteristics of the study population, of findings at colonoscopy and of diagnostic performance of the single tests. Women and men were about equally represented. The majority of participants were between 55 and 64 years old; median age was 63. The most advanced finding at colonoscopy was CRC in 11 cases (0.8%), an advanced adenoma in 130 cases (9.8%) and an other (nonadvanced) adenoma in 275 cases (20.7%). The positivity rate of the 6 tests varied widely, from 6.4% for immoCARE-C to 46.8% for Bionexia Hb/Hp Complex. Sensitivity and specificity with respect to detection of advanced neoplasms likewise varied widely, and they showed a clear positive and negative relation with the test positivity rate, respectively.
Table 3 illustrates the types of measures of inter-test agreement used in this analysis for the 2 tests with the lowest positivity rates (immoCARE-C and FOBadvanced, respectively, see Table 2). Both tests agreed for 1,165 + 66 = 1,231 of 1,330 participants. Given the marginal frequencies, both tests would have been expected to agree by chance in 1,118 participants. Both tests therefore agreed beyond chance expectations in 1,231 − 1,118 = 113 participants, which translates to a kappa value of κ = 0.53. Given the different marginal frequencies of both tests, the maximum possible value of kappa (κmax) would have been 0.71, i.e., the observed kappa reached 75% of its maximum possible value (κ/κmax = 0.75).
Table 4 shows the kappa coefficients for all pairwise comparisons of the 6 immunochemical tests. Overall agreement was generally poor or at best modest, with κ ranging from 0.14 to 0.61. In general, there was a close relationship between increasing differences in the positivity rate and decreasing overall agreement. However, apart from the different positivity rates, agreement was mostly very high, with κ/κmax ranging from 0.53 to 1.00, and exceeding 0.70 in 12 of 15 pairwise comparisons. Interestingly, κ/κmax was highest (0.93 and 1.00) for tests with the most divergent positivity rates (immoCARE-C versus QuickVue iFOB and Bionexia Hb/Hp Complex, respectively).
Table 5 shows the proportions of clearly positive, borderline positive and clearly negative test results, along with the median fecal hemoglobin concentrations within these categories according to the quantitative immunochemical test. The proportions of clearly positive results ranged from 3.6% for immoCARE-C to 28.2% for Bionexia Hb/Hp Complex, and the proportions of borderline positive results ranged from 2.8% for immoCARE-C to 18.6% for Bionexia Hb/Hp Complex. The ratio of borderline positive to clearly positive results ranged between 1:2 and 1:1 for all tests. The median hemoglobin concentration of stool samples with clearly positive and borderline positive results was inversely related to the frequency of such findings for the various tests. There was a gradient from the highest hemoglobin concentrations in clearly positive samples to the lowest concentrations in clearly negative samples for all immunochromatographical tests. However, for the 4 tests with the highest proportions of borderline positive results (PreventID CC, Bionexia FOBplus, QuickVue iFOB, Bionexia Hb/Hp Complex), the median hemoglobin levels were only slightly higher in case of borderline positive results than in the case of clearly negative results, and they were far below the threshold of positivity of the quantitative test (2 μg hemoglobin/g stool).
Table 6 shows the positivity rates, the sensitivity for detection of advanced neoplasms and advanced large neoplasms (≥1 cm) and the specificity with respect to absence of adenomas according to rating of borderline positive results. When borderline positive results were classified as positive (as suggested by the manufacturers), only 2 tests (immoCARE-C, FOB advanced) reached levels of specificity of 90% or higher, which are typically required for population-based screening (gray shaded fields). However, sensitivities for these tests were only 30 and 31% for detection of any advanced neoplasm and 39 and 38% for detection of large neoplasms, respectively. For these 2 tests, classification of borderline positive results as negative would further substantially reduce sensitivity, whereas specificity would only be slightly increased. By contrast, classification of borderline positive results as negative would increase specificity of PreventID CC and Bionexia FOBplus to levels of 90% and higher, while maintaining sensitivity of any advanced adenoma at 38 and 44% and for large adenomas at 45 and 56%, respectively. Sensitivities would even be higher for QuickVue iFOB and Bionexia Hb/Hp Complex, but specificity of these tests would remain at levels around 80% only even after reclassification of borderline positive results as negative.
In this large diagnostic study from the screening setting, positivity rates between 6 different qualitative, immunochromatographical FOBTs differed widely. Apart from the very different thresholds of positivity, however, inter-rater agreement between tests was generally very high. In particular, the tests with the lowest and highest positivity rate showed the maximum possible agreement within the limits set by the different thresholds of positivity. Comparison with the results of a quantitative immunochemical test likewise indicated differences between tests to essentially reflect divergent thresholds for detection of hemoglobin in stool. Furthermore, between one-third and one-half of positive results were borderline positive. Although these borderline results were to be classified as positive according to the manufacturers' instructions for all tests, comparison with the quantitative test results suggested that borderline results should rather be classified as negative for 4 of the tests, and further adjustment of cutoffs might be warranted for at least 2 of them.
Our finding of frequent occurrence of borderline test results and major discrepancies between different immunochromatographical FOBTs in a large sample of screening participants is in line with and extends recent observations in a much smaller sample of patients (n = 71) recruited in hospital and outpatient settings.10 For tests taken from the same stool bowel movement, between-test variation may reflect either within-sample heterogeneity of hemoglobin concentration or differences in the performance of tests. In the setting of our study, all tests were done under identical laboratory conditions and using stool provided in a small box containing up to 60 mL from the same bowel movement, which probably left little room for heterogeneity in sample handling or within-sample heterogeneity of hemoglobin concentration. The very high inter-test agreement within the thresholds imposed by the different test positivity rates seems to essentially rule out any major variation in test quality apart from the use of different thresholds. Interestingly, after accounting for differences in the positivity rate, inter-rater agreement between Bionexia Hb/Hp Complex, which determines both fecal hemoglobin and hemoglobin–haptoglobin levels, and the other tests, which determine fecal hemoglobin only, was as high as inter-rater agreement among the latter tests. Also, inter-test agreement for the 2 tests from the same company (Bionexia FOBplus and Bionexia Hb/Hp Complex) was in the same order of magnitude as inter-test agreement between tests from different companies.
To our knowledge, no previous study has assessed and compared the results of a panel of qualitative, immunochromatographical FOBTs and a quantitative immunochemical FOBT. Although agreement between qualitative and quantitative tests was generally rather good, with a very clear gradient from the highest fecal hemoglobin concentrations measured in patients with clearly positive results to the lowest concentrations measured in those with clearly negative results, our results again suggest the thresholds of positivity to be strongly divergent between the qualitative tests. For example, the median hemoglobin concentrations in patients with test results classified as borderline positive by immoCARE-C, the test with the lowest positivity rate, was substantially higher (8.0 μg/g stool) than the median hemoglobin concentration in patients with clearly positive results by 4 of the other tests (≤2.3 μg/g stool).
Although the threshold for positivity is also an issue when the cutpoint for positivity is to be defined for quantitative immunochemical FOBTs (or for any other quantitative tests),12, 13 the latter offers the possibility of flexible adaptation of the threshold for positivity to best meet the requirements in a specific screening setting.4, 8, 10, 14–18 Furthermore, automated reading of results reduces variation from and uncertainty in subjective judgment. These apparent benefits come at the price of the need for specialized laboratory equipment and processing. However, once these conditions are met and once an objective cutpoint has been defined, the analyst is not left with the need of subjective classification of borderline results as positive or negative.
Taken together, the results of our study suggest that differences between immunochromatographical tests essentially reflect different thresholds of positivity. The selection of such thresholds has important practical implications. Manufacturers should aim for adapting the thresholds of positivity of their tests and their instructions for how to interpret borderline results to achieve levels of sensitivity and specificity that best meet the requirements for population-based screening. Physicians and patients, for whom the information provided by the manufacturers is often the main source of information, should be aware of the current very strong variation in test characteristics across tests, including the strongly divergent meaning of a borderline positive result. Using current thresholds, borderline results should be rated negative rather than positive for several currently available tests, and further adjustment of cutoffs seems warranted. Given the strong variation in positivity rates, sensitivity and specificity of the various qualitative immunochromatographical tests, detailed validation studies in the screening setting, with a particular focus on the interpretation of borderline results, should precede their implementation and be continued thereafter. The latter applies to European countries including Germany in particular, where chemical modification of diagnostics is, in principle, possible without notice.
Our study has several limitations. Because of the low number of CRC cases, diagnostic performance could not be assessed separately with reasonable precision for this endpoint. Rather, CRC was combined with advanced adenomas to a common endpoint advanced neoplasms. Sample size limitations also prohibited more detailed analyses according to location of neoplasms, which might affect test performance.14, 19 Screening colonoscopy was considered as the gold standard for estimating sensitivity and specificity of the various tests. Although colonoscopy is probably the most reliable method for the detection of colorectal neoplasms, it is not perfect, and some proportion of adenomas are likely to have been missed,20 despite the high levels of qualification required for admission to screening colonoscopy in Germany. Raters were not blinded as to the particular tests, even though blinding with respect to colonoscopy finding was used. Our study differed from real life conditions in that stool samples were not directly dissolved in a buffer-filled vial and stored in this buffer until analysis. A potential impact of the latter on test positivity rates (potentially resulting from further degradation of hemoglobin) could thus not be assessed. A potential shift of the positivity rate, going along with divergent effects on sensitivity and specificity, might though have affected the different tests in a similar manner, without necessarily affecting inter-test agreement.
Despite its limitations, our study demonstrates that the strongly divergent positivity rates of 6 different immunochemical FOBTs most likely reflect, to a large extent, their different thresholds for detection of hemoglobin in stool. Apart from these differences, agreement between tests was very high. Definition of cutoffs is a critical issue in the application of immunochromatographical tests and should be redefined for several of the tests to limit false-positive rates in population-based screening.
The authors gratefully acknowledge excellent cooperation of physicians conducting screening colonoscopies in patient recruitment and excellent contributions of Isabel Lerch in data collection, monitoring and documentation. The test kits were provided free of charge by the manufacturers. The quantitative immunochemical tests were run free of charge by Labor Limbach, Heidelberg.
- 1Screening and surveillance for the early detection of colorectal cancer and adenomatous polyps, 2008: a joint guideline from the American Cancer Society, the US Multi-Society Task Force on Color\ectal Cancer, and the American College of Radiology. Gastroenterology 2008; 134: 1570–95., , , , , , , , , , , , et al.
- 11http://www.r-biopharm.com/product_site.php?product_id=39&, Accessed October 26, 2009.
- 13Noninvasive testing for colorectal cancer: a review. Am J Gastroentrol 2005; 100: 1393–403., , , .Direct Link:
- 18Identification of colorectal adenomas by a quantitative immunochemical faecal occult blood screening test depends on adenoma characteristics, development threshold used and number of tests performed. Aliment Pharmacol Ther 2009; 29: 906–17., , , , , , , , .
- 20Polyp miss rate determined by tandem colonoscopy: a systematic review. Am J Gastroenterol 2006; 101: 343–50., , , , , .Direct Link: