Objectives To assess the diagnostic accuracy of the H2S test for microbiological contamination of domestic water across different settings, as a basis for providing guidance on its use.
Methods We searched a range of bibliographic and ‘grey’ literature databases to identify studies that had processed domestic water samples using both the H2S test and recognized tests for thermotolerant coliforms or Escherichia coli. We screened 661 study abstracts and identified 51 relevant studies based on 13 853 water samples. For each relevant study, we recorded the level of correspondence between the H2S and recognized tests, microbial testing procedures, details of the samples processed and study quality indicators. We conducted a meta-analysis to investigate the impact of testing procedures, study quality and sample characteristics on the diagnostic accuracy of the H2S test.
Results H2S test implementation varied between studies, and the test’s diagnostic accuracy varied significantly and substantially between studies. Little of this variation was explained by testing procedures, study quality or the nature of the samples processed.
Conclusions Although in widespread use, our findings suggest that the diagnostic accuracy, particularly specificity, of the H2S test is variable. Optimal conditions for conducting the test remain unclear. As H2S test accuracy is low in a minority of these studies, we recommend that its performance be evaluated relative to standard methods, prior to its operational deployment in a new setting.
Objectifs: Evaluer la précision diagnostique du test H2S pour la contamination microbiologique de l’eau domestique dans différents contextes, afin de pouvoir conseiller sur son utilisation.
Méthodes: Nous avons recherché dans une gamme de bases de données bibliographiques et la littérature «grise» pour identifier les études ayant testé des échantillons d’eau domestique en utilisant à la fois le test H2S et des tests reconnus, basés sur les coliformes thermo tolérants ou E. coli. Nous avons parcouru 661 résumés d’études et identifié 51 études pertinentes menées sur 13853 échantillons d’eau. Pour chaque étude pertinente, nous avons noté le niveau de concordance entre le test H2S et les tests reconnus, les procédures des tests microbiens, les informations sur les échantillons traités et les indicateurs de qualité de l’étude. Nous avons effectué une méta-analyse pour évaluer l’impact des procédures des tests, la qualité de l’étude et les caractéristiques des échantillons sur la précision diagnostique du test H2S.
Résultats: L’implémentation du test H2S variaient selon l’étude et la précision diagnostique du test variait de manière significative et substantielle selon l’étude. Cette variation s’expliquait quelque peu par les procédures de tests, la qualité de l’étude ou la nature des échantillons traités.
Conclusions: Nos résultats suggèrent que bien que d’utilisation répandue, la précision diagnostique du test H2S, en particulier la spécificité, est variable. Les conditions optimales pour effectuer le test restent floues. Comme la précision du test H2S est faible dans certaines de ces études, nous recommandons que sa performance soit évaluée par rapport aux méthodes standard, préalablement à son déploiement opérationnel dans un nouveau cadre.
Objetivos: Evaluar la precisión diagnóstica de la prueba de H2S en la contaminación microbiológica de aguas domésticas a lo largo de diferentes emplazamientos, como base para dar información sobre su uso.
Métodos: Hemos realizado una búsqueda en un rango de bases de datos bibliográficas y de literatura ‘gris’ para identificar estudios en los que se han procesado muestras de agua domésticas utilizando tanto la prueba de H2S como pruebas reconocidas para coliformes termotolerantes o E. coli. Hemos evaluado 661 resúmenes de estudios e identificado 51 estudios relevantes de 13,853 muestras de aguas. Para cada estudio relevante, hemos anotado el nivel de correspondencia entre el H2S y las pruebas reconocidas, procedimientos de testaje microbiano, detalles de las muestras procesadas e indicadores de calidad del estudio. Realizamos un meta-análisis para investigar el impacto de los procedimientos de prueba, la calidad del estudio y las características de precisión diagnóstica de la prueba del H2S.
Resultados: La implementación de la prueba de H2S varió entre estudios y la precisión diagnóstica de la prueba varió significativamente y sustancialmente entre estudios. Poca variación puede explicarse por los procedimientos de las pruebas, la calidad del estudio o la naturaleza de las muestras procesadas.
Conclusiones: Aunque ampliamente utilizada, nuestros hallazgos sugieren que la precisión diagnóstica, particularmente la especificidad de la prueba H2S, es variable. Las condiciones óptimas para realizar la prueba siguen sin estar claras. Puesto que la precisión de la prueba H2S es baja en una minoría de estos estudios, recomendamos que su desempeño se evalúe frente a métodos estándar, antes de su despliegue operativo a nuevos emplazamientos.
Diarrhoeal diseases from inadequate water, sanitation and hygiene annually cause an estimated 2.2 million deaths globally and 76.3 million disability adjusted life years (DALYs), 3.9% and 5.3% of the respective global totals (Prüss et al. 2002). Faecal contamination, the most widespread health risk associated with drinking water, is identified through microbiological testing. Standard tests require the collection of a sample and processing at well-resourced laboratories. Many developing countries lack such facilities, with resources, budgets and poorly trained staff further hampering the ability to deliver accurate assessments.
A simple, low-cost presence/absence test for bacteria that produce hydrogen sulphide (H2S) has been proposed to assess faecal contamination (Manja et al. 1982). As originally proposed, H2S test medium contains 20 g of peptone, 1.5 g of dipotassium hydrogen phosphate, 0.75 g of ferric ammonium citrate, 1 g of sodium thiosulphate, 1 ml of teepol and 50 ml of water. After incubation for at least half a day, often under ambient conditions, a change to a black or grey colour indicates the presence of H2S-producing bacteria and presumed faecal contamination. H2S test kits are relatively easy to manufacture, so are often made locally, at lower cost than recognized methods (Genthe & Franck 1999). The test is used in many developing countries, in emergencies (Mosley & Sharp 2005) and in remote areas of developed countries (UNICEF 2007).
Previous studies have compared H2S test results with those for recognized methods to enumerate thermotolerant coliforms (TTC) or Escherichia coli (EC). An earlier non-systematic review examined this literature (Sobsey & Pfaender 2003), identifying false-positive results from groundwater samples as a particular concern and recommending its use with caution, only where other alternative tests are infeasible. This study aims to assess variation in diagnostic accuracy of the H2S test in drinking or domestic water across different settings based on a systematic review and meta-analysis.
Search strategy and selection criteria
We conducted a systematic review and meta-analysis, the protocol for which is available from the corresponding author on request. We identified microbiological studies that simultaneously test the same set of domestic water samples for faecal contamination using two different methods, the H2S test and a standard, recognized test for TTC or EC. The target condition being diagnosed was thus faecal contamination of domestic water, with the associated risk of water-borne disease.
We included studies of the H2S test where its implementation was low-cost (i.e. less than US$2/test) and capable of use onsite (i.e. not requiring permanent laboratory facilities). Included comparator tests for the TTC or EC indicator bacteria groups were recognized as an International Organisation for Standardisation standard, in the UK Blue Book methods (http://www.environment-agency.gov.uk/research/commercial/32874.aspx), in the American Water Works Association Standard Methods (Eaton et al. 2005), USEPA (http://www.epa.gov/fem/methcollectns.htm) or in equivalent country-specific standards at the time of the study. We included samples from stored water and supply systems for both households and public places such as schools and clinics. We also included surface water samples from streams, rivers and lakes where used for drinking.
The search strategy for bibliographic databases combined terms for the H2S test (e.g. ‘H2S’, or ‘hydrogen sulphide’, or ‘hydrogen sulphide’, or ‘pathoscreen’, or ‘Manja’) with terms for domestic water samples (e.g. ‘water’, or ‘environmental samples’) and terms for comparator test indicator bacteria groups (e.g. ‘thermotolerant’, or ‘faecal’, or ‘faecal’, or ‘coliform’, or ‘E. coli’). These terms were used in English and Chinese with 21 bibliographic databases and 12 sources of grey literature (online-only supplementary materials, Table S1).
Bibliographic and grey literature searches covered the period until 15 February and 2 March 2010, depending on the source. References to and from included studies, a relevant review (Sobsey & Pfaender 2003), and to Manja's original study (Manja et al. 1982) were also reviewed. Where the full text of a potentially relevant paper was unavailable, we contacted one or more of the authors. Studies identified by searches were independently assessed for inclusion by two members of the research team (JAW, HY), who then independently recorded characteristics of included studies. They also assessed study quality against 14 criteria commonly used for clinical diagnostics (Whiting et al. 2003). Arithmetic errors in each included study report were also recorded. Disagreements were resolved by consensus or referred to other team members (SP, JE).
We used the 1 cfu/100 ml threshold to define contamination with TTC or EC wherever possible, as this reflects World Health Organization guideline values (WHO 2008). Where necessary, we recorded individual sample data and calculated numbers of concordant and discordant samples.
Meta-analyses were based on sensitivity and specificity, determined using numbers of concordant and discordant samples. Univariable meta-regression was applied within the bivariate random effects framework using the Stata midas routine (Arends et al. 2008), to identify characteristics of studies, samples and test procedures associated with test performance.
For 21 included studies that implemented multiple H2S test procedures (e.g. evaluating the test under different incubation conditions or media) and compared these with standard methods, we selected the most widely used H2S procedure across all included studies. We undertook a sensitivity analysis to test the robustness of our findings to this strategy for choosing comparison groups. To investigate publication bias, we looked for asymmetry on a funnel plot of the diagnostic odds ratio. We also performed a linear regression of log odds ratios on the inverse root of the effective sample size to test for funnel plot asymmetry (Deeks et al. 2005).
As the search strategy and study selection results given in Figure 1 show, 51 studies were included in the systematic review from 58 reports, whose characteristics are shown in Table 1. These studies reported a total of 13 853 water samples tested using both the H2S method and standard assays for TTC or EC. The 51 studies comprised 3140 groundwater samples; 845 surface or rainwater samples; 4404 distribution system, tap or standpipe samples; and 4111 samples of water stored in the home or public places. The source of a further 1353 samples was unclear, and 4106 samples had been treated. Twenty-nine studies used TTC, of which 11 used 1 cfu/100 ml to define contamination. Twenty-three studies used EC as the reference test, of which 12 defined water contamination using a threshold of 1 cfu/100 ml.
Table 1. Characteristics of the 51 studies included in the systematic review (see Appendix S1 for references)
H2S test procedures
Dominant sample type
Incubation period (h)
Incubation temperature (°C)
Geometric mean bacteria density (cfu/100 ml)
Threshold density (cfu/100 ml)
No of reporting errors
Dominant water source abbreviations: DST, distribution system/tap; GW, ground water; T, treated; POU, point-of-use; SRW, surface/rain water; abbreviation for the H2S test media formulation: M1982, Manja et al. (1982); M2001, Manja et al. (2001); M1982_1, Manja et al. (1982) + ‘6-fold liquid’ + lauryl sulphate, sodium salt; M1982_2, Manja et al. (1982) + l-cystine; M1982_3, Manja et al. (1982), replacing teepol with bile salt; M1982_4, modified Manja et al. (1982), but without media details; V1994, Venkobachar et al. (1994); Hiselective, Hiselective H2S kit; TTC, thermotolerant coliforms; Threshold, threshold density used to define contaminated sample.
†Hewison et al. (1988) reported on two separate studies.
‡Raka et al. (1999) report a main study and subsequent follow-up work using different H2S test media, separated here as two studies.
The implementation of the H2S test varied markedly between studies, with a wide range of incubation temperatures and periods used (Table 1). Most studies used locally manufactured media, rather than using commercially available test kits. Water sample volumes were largely 20 ml, although four studies used 10 ml and nine studies used 50 or 100 ml. Thirty-two studies performed the H2S test procedure in the laboratory with skilled staff, although the test is often intended for field use by staff with only basic training. H2S tests were performed in a variety of ways. Twenty-three studies used the original media proposed by Manja et al. (1982), three modified the original media by adding l-cystine, two modified the original media by replacing Teepol with bile salt or lauryl sulphate and sodium salt, whilst two used modified media suggested in Manja et al. (2001).
We were unable to determine numbers of concordant and discordant test results in 9 of these 51 studies. This was because we lacked full text reports for two studies, numbers of concordant and discordant samples were not provided in the full text for six studies, and one further study was still ongoing. Three studies contained no false positives and true negatives or no false negatives and true positives, so we were unable to calculate sensitivity and specificity. There were a further three studies for which the numbers of concordant and discordant samples were described with a small degree of ambiguity in the original text. After excluding studies where we lacked numbers of concordant and discordant samples, this left 8279 samples available for meta-analysis.
We identified one study for which a preliminary report was available, but from which full results were not yet published (Khush et al. 2009). There were also 23 potentially relevant studies for which we were unable to obtain full texts, of which 4 were characterized based on abstracts and citations. Of the remainder, we did not obtain full texts because either these were oral presentations with no accompanying manuscript (12 studies) or the original report was unobtainable (7 studies).
The scores for the 14 study quality criteria for each study are summarized in Table 2. There were some generic strengths (e.g. simultaneous sample collection for both tests) and weaknesses (e.g. possible influence of one test result on the reading of the other) common to almost all studies. Scores for the remaining study quality items varied between studies. The test for funnel plot asymmetry suggested a low likelihood of publication bias (P = 0.29 for TTC; P = 0.13 for EC).
Table 2. Percentage of the 51 included studies meeting 14 study quality items, derived from a protocol used for studies of clinical diagnostics (Whiting et al. 2003)
Were water samples drawn from areas where the H2S test would typically be deployed?
Do the authors describe how water sampling points were chosen?
Were quality control procedures described for thermotolerant coliforms (TTC) or Escherichia coli testing?
Were water samples processed simultaneously using the H2S test and the TTC or E. coli test?
Were all water samples (or a random subset) tested for both H2S and TTC or E. coli?
Did the TTC or E. coli test remain the same, regardless of the H2S test result?
Did the H2S test not form any part of the TTC or E. coli test procedure?
Were H2S test procedures described in sufficient detail to permit test replication?
Were TTC or E. coli test procedures described in sufficient detail to permit test replication?
Were the H2S test results interpreted without knowledge of the TTC or E. coli test results?
Were the TTC or E. coli samples interpreted without knowledge of the H2S test results?
Were the same sample details available during test interpretation as would be available in practice?
Were H2S samples that changed colour slightly (e.g. to light grey not black) reported?
Were water samples for which no results were presented explained?
Forest plots of H2S test sensitivity and specificity for TTC and EC are shown in Figure 2. There was substantial heterogeneity in sensitivity for both TTC (I2 = 89.1% [85.7–92.5]) and EC (I2 = 93.6% [91.6–95.7]). Similarly, there was substantial heterogeneity in specificity for both TTC (I2 = 96.4% [95.6–97.2]) and EC (I2 = 95.4% [94.1–96.8]), with TTC specificity being more heterogenous than TTC sensitivity.
Figure 3 shows the test’s positive predictive value (proportion of all H2S positive samples that were true positives) relative to the percentage of samples with detectable TTC or EC. Unsurprisingly, positive predictive value was lower when few samples had detectable TTC or EC.
Meta-regression findings were inconsistent between TTC and EC and fewer explanatory variables were associated with specificity than sensitivity (Figure 4). Although some study quality items were associated with sensitivity for both TTC and EC, it was difficult to discern any consistent effect of study quality on diagnostic accuracy. Focussing on findings significant at the 1% level, lower sensitivity with TTC was associated with smaller H2S sample volumes, shorter incubation periods and higher incubation temperatures (Figure 4a) and with sampling strategies that focussed on remote settings and were well documented. Variation in specificity with TTC was not related to the variables presented in Figure 4b. Sensitivity with EC (Figure 4c) was lower for studies conducted on the Indian subcontinent, in studies predominantly based on groundwater samples and where microbiological procedures were described in detail. Specificity with EC was higher for studies from the Indian subcontinent (Figure 4d).
In contrast to a previous literature review (Sobsey & Pfaender 2003) of 13 studies, this quantitative analysis systematically synthesises evidence on H2S test diagnostic accuracy from 51 studies. Our study used meta-analysis to estimate heterogeneity in diagnostic accuracy and its causes. To our knowledge, meta-analysis of diagnostic accuracy has to date been applied to clinical diagnostics, rather than to environmental diagnostics as in this study.
Our review suggests that there are now many studies of H2S test diagnostic accuracy, although implementation varies between studies. Many of the studies analysed took place in India and Nepal, perhaps, reflecting widespread H2S test use there. There was, however, substantial heterogeneity in both sensitivity and specificity relative to both TTC and EC. As little of this heterogeneity was explained by the H2S procedures used, study quality or the types of sample processed, this unexplained variation could result from unmeasured differences in test procedures (e.g. the experience of test operators) or sample characteristics (e.g. variations in microbial ecology).
In most studies reviewed here, H2S test performance would be considered as moderate to high accuracy based on area under the receiver operating characteristic (ROC) curve (Swets 1988), although performance varied widely between studies and was sometimes poor. In almost all studies, its performance is poorer than that of recognized microbiological tests. For example, Colilert achieved 0.94 for both sensitivity and specificity compared with those of existing standard methods of the time (Edberg et al. 1988), whilst Colitag and Colisure achieved sensitivities of 0.96 and 0.98 and specificities of 0.93 and 0.96, respectively (NEMI 2009; McFeters et al. 1995). Its current status as a non-standard test for use in remote, resource-poor settings and for educational purposes seems consistent with its performance.
Meta-regression findings were inconsistent between TTC and EC, and fewer explanatory variables were associated with specificity than sensitivity. The greater sensitivity with TTC observed with higher H2S sample volumes and longer incubation times is consistent with that of previous research (Gupta et al. 2008). However, lower sensitivity with EC was observed in studies based on predominantly groundwater samples. This runs counter to an anticipated decrease in specificity in groundwater, owing to sulphide formation by non-faecal bacteria in anaerobic aquifers (Sobsey & Pfaender 2003). Sensitivity with EC was lower and specificity higher in studies from the Indian subcontinent, where H2S use is widespread. The lack of any strong association between test performance and bacterial contamination levels is surprising, given some previous findings (Gupta et al. 2008).
As is well recognized in clinical diagnostics (Grimes & Schulz 2002), the use of even a relatively accurate test may be problematic in settings where water quality is generally good. The proportion of times a positive H2S test result proves to be a ‘false alarm’ (its positive predictive value) depends on how often water source contamination occurs as well as test accuracy. Where contaminated sources are commonplace, positive H2S results frequently reflect genuine contamination (Figure 3). Where contaminated sources are rare, the small number of truly positive H2S results is often dominated by false positives, producing many more ‘false alarms’. A false-positive test result can potentially have undesirable consequences, such as causing people to boil their drinking water unnecessarily or switch to more distant water sources. The benefit of identifying a small group of contaminated sources through field testing may therefore be offset by unnecessary concern and action as a result of false positives from a larger group of water sources. There is thus a weaker rationale for using a moderately accurate test like H2S where water contamination is seldom encountered.
As H2S test accuracy is low in a minority of these studies, we recommend its performance be evaluated relative to standard methods, prior to its operational deployment in a new setting.
Four issues could have affected our conclusions: We assumed that recognized microbiological methods are equivalent to one another. In reality, such methods may vary both in their ability to recover indicator bacteria injured through water treatment and in their precision. More generally, TTC and EC have their weaknesses as indicator bacteria (Gleeson & Gray 1997), and there is no definitive means of identifying faecally contaminated drinking water (Sobsey & Pfaender 2003).
We searched for relevant literature in English and Chinese, but may have missed reports in other languages. We were unable to obtain the full text of some potentially relevant reports. This may have influenced our findings if test accuracy was systematically different in these reports.
Included studies typically conducted H2S testing in the laboratory through skilled individuals. Whilst we found no systematic differences in accuracy between field- and laboratory-based studies, these studies may collectively overestimate diagnostic accuracy as a consequence. In particular, they do not evaluate the effectiveness of any associated training in water testing.
In studies with multiple comparison groups based on differing H2S procedures, we included only one group in our analysis as described earlier. This may have reduced our ability to discern an effect of H2S procedures on accuracy.
Aside from analysing groups of water samples, it may be possible to gain further insights by analysing individual water sample results, particularly in relation to differing indicator bacteria concentrations. More generally, we used indirect comparison to examine different H2S procedures between studies, but ignored any direct comparison of H2S procedures within studies. In clinical meta-analysis, techniques are developing to synthesize evidence from both direct comparisons within studies and indirect evidence across studies (Caldwell et al. 2005). Were these techniques to be extended to handle studies of diagnostic accuracy, there would be value in applying them to the H2S test literature.
The H2S test remains substantially cheaper than other alternatives. Consumables costs for a 100 ml presence/absence H2S test have been estimated as $0.35 per test, compared with over $1.50 per test for commercially manufactured alternatives (Chuang et al. 2011). A Chilean study estimated annual monitoring costs as $84.8 using H2S and field incubation, but $211.2 using laboratory-based membrane filtration (Castillo 1997), with savings on both transportation and testing procedures. This lower cost makes the test affordable where conventional methods are too expensive, but also means a larger number of H2S tests could be conducted for the same cost as a smaller number of conventional tests. Repeat presence/absence tests using H2S may be particularly valuable in supply systems where intermittent contamination events occur, such as those documented following pressure drops in piped systems.
Although not equivalent to laboratory testing, the H2S test remains an alternative in remote and resource-poor settings because of its low cost and ease of use. The lower cost of H2S also makes more frequent testing feasible, which may be particularly valuable in supply systems where intermittent contamination occurs. However, given the variable accuracy observed in studies reviewed here, we recommend benchmarking the test against standard methods, prior to its operational deployment in a new setting. We also recommend exercising caution when using the test where contaminated supplies are seldom encountered because false positives may outnumber true positives in this situation.
This research has been funded by the Bill & Melinda Gates Foundation, under a grant to the University of Bristol, whose purpose is to develop a low-cost test of water quality for use in developing countries. We thank Dr Patricia Lucas of the School of Policy Studies, University of Bristol for reviewing an early version of the protocol for this study and Prof. Paul Hunter of the University of East Anglia for providing overall comments on a version of this paper.