Towards more accurate HIV testing in sub-Saharan Africa: a multi-site evaluation of HIV RDTs and risk factors for false positives

Abstract Introduction: Although individual HIV rapid diagnostic tests (RDTs) show good performance in evaluations conducted by WHO, reports from several African countries highlight potentially significant performance issues. Despite widespread use of RDTs for HIV diagnosis in resource-constrained settings, there has been no systematic, head-to-head evaluation of their accuracy with specimens from diverse settings across sub-Saharan Africa. We conducted a standardized, centralized evaluation of eight HIV RDTs and two simple confirmatory assays at a WHO collaborating centre for evaluation of HIV diagnostics using specimens from six sites in five sub-Saharan African countries. Methods: Specimens were transported to the Institute of Tropical Medicine (ITM), Antwerp, Belgium for testing. The tests were evaluated by comparing their results to a state-of-the-art reference algorithm to estimate sensitivity, specificity and predictive values. Results: 2785 samples collected from August 2011 to January 2015 were tested at ITM. All RDTs showed very high sensitivity, from 98.8% for First Response HIV Card Test 1–2.0 to 100% for Determine HIV 1/2, Genie Fast, SD Bioline HIV 1/2 3.0 and INSTI HIV-1/HIV-2 Antibody Test kit. Specificity ranged from 90.4% for First Response to 99.7% for HIV 1/2 STAT-PAK with wide variation based on the geographical origin of specimens. Multivariate analysis showed several factors were associated with false-positive results, including gender, provider-initiated testing and the geographical origin of specimens. For simple confirmatory assays, the total sensitivity and specificity was 100% and 98.8% for ImmunoComb II HIV 12 CombFirm (ImmunoComb) and 99.7% and 98.4% for Geenius HIV 1/2 with indeterminate rates of 8.9% and 9.4%. Conclusions: In this first systematic head-to-head evaluation of the most widely used RDTs, individual RDTs performed more poorly than in the WHO evaluations: only one test met the recommended thresholds for RDTs of ≥99% sensitivity and ≥98% specificity. By performing all tests in a centralized setting, we show that these differences in performance cannot be attributed to study procedure, end-user variation, storage conditions, or other methodological factors. These results highlight the existence of geographical and population differences in individual HIV RDT performance and underscore the challenges of designing locally validated algorithms that meet the latest WHO-recommended thresholds.


Introduction
HIV rapid diagnostic tests (RDTs) are the main diagnostic tool for HIV screening and diagnosis in resource-constrained settings [1]. Simple and fast, they require little or no equipment, and provide results usually within 20 min. Most RDTs involve very few manipulation steps, can be read visually, and can be stored at ambient temperature. At a price per test of US$ 1-2, RDTs are ideal for use in settings without the infrastructure or expertise to support the use of more complex techniques.
Given the potential for severe psychological and social impacts of HIV misdiagnosis, it is imperative that HIV diagnosis is highly sensitive and specific. HIV misdiagnosis has been a problem in some Médecins Sans Frontières (MSF) programmes in sub-Saharan Africa where HIV care is provided in partnership with local Ministries of Health [2,3]. In addition to the psychological trauma a misdiagnosis can induce in the individual patient, who may inappropriately have been initiated on treatment that is both costly and potentially harmful, there is also the considerable programmatic impact of false positives, which siphon off scarce resources and may undermine client-patient confidence in the testing [4,5].
World Health Organization (WHO) guidelines for HIV testing and counselling recommend an algorithm consisting of 2-3 RDTs chosen on the basis of their performance (clinical sensitivity 99% and clinical specificity ≥98% for the first-line assay, and ≥99% for the second line assay), operational characteristics and local evaluation results, among other factors [1].
Despite the continuing widespread use of RDTs for HIV diagnosis in resource-constrained settings, there has been no systematic, head-to-head evaluation of their accuracy with specimens from diverse settings across sub-Saharan Africa.
We report here the results of a standardized, centralized evaluation of eight HIV RDTs and two simple confirmatory assays at a WHO collaborating centre for evaluation of HIV diagnostics using specimens collected from six sites in five sub-Saharan African countries. Algorithms will be elucidated and discussed in a separate publication.

Study setting
This study was carried out at six public health care clinics and hospitals in sub-Saharan Africa where Médecins Sans Frontières (MSF) supports health care activities: (1) Centre Communautaire Matam in Conakry, Guinea, (2) Madi Opei Clinic and Kitgum Matidi Clinic in Kitgum, Uganda, (3) Homa Bay District Hospital in Homa Bay, Kenya, (4) Arua District Hospital in Arua, Uganda, (5) Nylon Hospital in Doula, Cameroun and (6) Baraka Hospital in Baraka, South-Kivu, DRC. The six sites were selected from among MSF-supported HIV testing and counselling (HTC) sites to represent geographical diversity and a range of characteristics (urban and rural, voluntary and provider-initiated testing, different HIV prevalence). The HIV national reference laboratory at the Institute of Tropical Medicine (ITM, Antwerp, Belgium) served as the central laboratory for this study.

Study design and sample size
This was a multi-centre evaluation of the diagnostic accuracy of eight individual HIV RDTs and two simple HIV confirmatory assays on the following measures: sensitivity, specificity and predictive values.
At least 200 positive and 200 negative samples from study participants were collected for evaluation at each study site [20]. The sample size was calculated based on the assumption that both sensitivity and specificity must be 98% in order to provide a 95% confidence interval of less than ±2% for both sensitivity and specificity.
The prevalence of HIV positives among the suspects tested at each study site was known. If it was ≥40%, we collected all specimens consecutively and calculated the total sample size based on the prevalence to obtain at least 200 HIV-positive and 200 HIV-negative samples and increased the calculated sample size by 10% to account for losses and/or problems in shipment.
If the prevalence of positive results was below 40%, we obtained a subsample of positive and negative specimens. Conservatively assuming 10% misclassification, we collected a sub-sample of 220 positive and 220 negative samples based on the on-site algorithm result. All samples with an inconclusive result were included. For this sampling strategy, we first included consecutively all clients, regardless of their results. Once the sample size for negative clients was reached, we stopped including HIV-negative clients (based on their on-site results) and included all clients diagnosed as HIV positive or inconclusive, for example, RDT1 positive and RDT2 negative, based on the on-site algorithm.

Study population
Clients ≥5 years of age who attended any of the participating HIV testing and counselling (HTC) centres and for whom written informed consent was provided by the client or legal guardian were included in the study. Upon enrolment, clients were offered HTC in accordance with site-specific procedures and testing algorithms. Exclusion criteria were: withdrawal of consent; inability to obtain a venous blood sample or insufficient blood; and current or past enrolment on anti-retroviral treatment.

Sample collection, storage and transportation
Venous EDTA blood was collected by the study nurse or laboratory technician. The EDTA blood samples were centrifuged, aliquoted and stored at −20°C until being transported at 2-8°C to the central laboratory (ITM) in Belgium. The storage temperature of freezers was monitored daily and a temperature recording system was used during transportation. At ITM, samples were immediately tested using the reference algorithm and remaining plasma samples were aliquoted further and stored a −20°C until testing of RDTs.

Reference method for HIV diagnosis
Clients' status was determined by using the reference standard algorithm at the AIDS reference laboratory at ITM, Antwerp, Belgium ( Figure 1) on collected plasma samples. All samples were tested by a fourth generation ELISA (Vironostika® HIV Uni-Form II Ag/Ab, bioMérieux, France) and all reactive samples were confirmed by a Line-Immunoassay (LIA, i.e. INNO-LIA™ HIV I/II Score, Innogenetics NV, Ghent, Belgium). Samples with a negative or indeterminate LIA were tested with an antigen-enzymeimmunoassay (Ag-EIA, i.e. INNOTEST HIV Antigen mAb, Innogenetics NV, Ghent, Belgium) to confirm acute infections. In the event that the LIA could not differentiate between HIV-1 and HIV-2, we used an in-house DNA PCR.

HIV RDT
The following eight HIV RDTs were tested at ITM on all plasma samples collected from the six study sites. . Each test was read by two laboratory technicians who were blinded to each other's result. If a reader disagreed, a third reader acted as tiebreaker.
All but one of the RDTs is prequalified by the WHO [21], and the one exception, Genie Fast, has been submitted for prequalification [22].
Tests were performed and interpreted according to the manufacturer's instructions. An additional analysis was performed with the ImmunoComb using an alternate interpretation based on the strict criteria used in an earlier evaluation ( Figure 2) [2].
All tests were read by two laboratory technicians who were blinded to each other's interpretation and to the client's HIV status. If the two readers disagreed, a third reader acted as tie-breaker. Band intensity was recorded by the two readers and graded from 1 to 3 (1 = weak line, 2 = medium strength line, 3 = strong line).

Statistical analysis
Data were analyzed using Stata version 13.1 (StataCorp, College Station, Texas, USA).
We estimated the sensitivity, specificity and predictive value for each RDT and simple confirmatory assay by comparing the results of these tests performed at ITM to the results of the reference standard. The analysis was weighted to adjust for the sampling strategy, which underrepresented negative samples. For each participant, the weight was calculated as the inverse of the probability of inclusion in the study. For the total adjusted estimates, the weights were normalized to ensure equal representation of each site. Weighted proportions (e.g. weighted proportion of RDT reactive among all true positives by the reference standard for sensitivity) were calculated using the svy survey prefix command in Stata.
To measure inter-reader reliability, the level of concordance between results reported by the two laboratory technicians independently reading the test was evaluated using the kappa coefficient. A Kappa value ≥80% was considered very good agreement.
For each rapid test, factors associated with false positivity were analyzed using logistic regression with age, gender, inclusion site, entry mode and comorbidity included as covariates.

Ethics
The study was approved by the MSF Ethics Review Board and the Ethics Committee of the five countries where the study took place.

Results
Characteristics of the study population From August 2011 to January 2015, a total of 2785 samples were collected at the six HTC sites and tested at the central laboratory (Table 1), with 437-500 samples collected per study site. Of the total 2785 samples, 1474 were found to be HIV negative and 1306 HIV positive (including one positive for HIV-2) by the reference algorithm ( Figure 1). Three samples with indeterminate results and two classified as acute infections were excluded from the analysis.
Most study participants were females (61.9%). The median age of study participants was 30 (IQR: 24-39). Most participants presented for testing at the HTC facility voluntarily, or were referred by their spouse, with variations among sites (Table 1).

Diagnostic accuracy of the HIV RDTs
Adjusted (weighted) sensitivities ranged from 96.2% to 100% with specimens from different study sites (Table 2). Adjusted sensitivities <99% were found for four tests (Uni-Gold, Vikia, STAT-PAK and First Response) using specimens from Kitgum; and for the First Response test using specimens from Douala (97.7%) and Baraka (96.8%). The First Response was the only RDT with an overall (total) adjusted sensitivity <99% (Table 2). Unadjusted (unweighted/crude) sensitivities are shown in Additional File 1.
Adjusted specificities across the six sites varied from 77.0% for First Response on specimens from Kitgum to 100% for STAT-PAK on specimens from Conakry and Kitgum ( Table 2). The INSTI and the First Response test had the lowest overall adjusted specificities (<90%), while STAT-PAK was the only RDT with an adjusted total specificity >98% (Table 2).

Band intensity and inter-reader agreement
The proportion of weak bands (intensity = 1) read by each of the readers is shown in Table 4. Weak bands were seen     (Tables 3 and 4). Very good inter-reader agreement was found for all HIV RDTs, with kappa coefficients ranging from 98% to 100% ( Table 5). The Vikia and STAT-PAK tests showed no disagreement between readers. The agreements for the simple confirmatory tests were lower than for the RDTs (Table 5).

Diagnostic accuracy of the simple HIV confirmatory assays
The total adjusted sensitivity of both simple confirmatory assays was close to 100% ( Table 6). The specificity of the ImmunoComb increased from 98.9% seen with the manufacturer's recommended interpretation to 99.4% when using the alternative interpretation criteria [18], while the rate of indeterminate results increased from 8.9% to 9.8%.
The specificity of the Geenius assay varied from 97.6% to 98.3% for visual versus automated reading with similar rates of indeterminate results for visual reading (9.2%) and automated reading (9.4%). Overall, measurement with the automated reader was as accurate or more than with the naked eye (Table 6).
Similar to results for the RDTs, specificities of both simple confirmatory assays varied across sites, with the lowest specificities recorded on specimens from Baraka (Table 6). Unadjusted (unweighted/crude) performance data are displayed in Additional File 2.
False reactive results and their associated risk factors A total of 438 specimens gave false-positive results with at least one RDT. False-positive results were associated with different factors for each of the tests, as shown by the odds ratio for false-positive results in a multivariate analysis    (Table 7). For Determine, the main determinant for a falsepositive result was to be referred for testing by a clinician from the IPD, OPD or the TB clinic (i.e. possibly due to presence of comorbidities), whereas with Genie Fast and Vikia, a false positive was mostly strongly associated with being male. Differences by origin remained significant only for INSTI, SD Bioline, and First Response. More detailed analyses per test are provided in Additional File 3.

Discussion
Growing awareness of problems with patient misdiagnosis at some HIV testing sites in sub-Saharan Africa, and inconsistent findings on the accuracy of widely used simple diagnostic tests, have highlighted the urgent need for a comprehensive, systematic evaluation of these tests, with special emphasis variation in their performance by geographical location and other characteristics [5]. All but one of the RDTs evaluated here has been WHO prequalified, and of them, only STAT-PAK recorded a final sensitivity of less than 100% (99.5%) [6,7]. The final specificities in the WHO prequalification evaluations were: 100% for STAT-PAK, 99.9% for SD Bioline and Vikia, 99.4% for First Response and 98.9% for Determine [6,7]. However, in our evaluation, individual RDTs performed more poorly than in WHO evaluations with only one test (STAT-PAK) meeting the recommended thresholds for RDTs of ≥99% sensitivity and ≥98% specificity when using total estimates [1]. None of the tests met the WHO-recommended thresholds for sensitivity and specificity when using the lower end of the 95% CI [1]. While all but one HIV RDT and two simple confirmatory assays had total adjusted sensitivities ≥99%, the biggest problem identified was specificity, which varied widely among the different tests and by samples' origin. Only one of the eight tests (STAT-PAK) had a total adjusted specificity ≥98%, exceeding the WHO-recommended threshold (lower end of the 95% CI of ≥98%) [1] at five of six sites; two other tests (SD Bioline and First Response) exceeded it at one site. Although confirmatory assays are presumed to have higher specificity than RDTs, the two simple confirmatory assays evaluated here showed a specificity ≥98% at only half the study sites. None of the confirmatory assay met the WHO threshold of the lower end of the 95% CI interval of ≥99% [1].
It has been proposed that cross reactivity, either direct or indirect, may be responsible for the variable performance of RDTs in different populations and test sites, and that concomitant disease, such as kala azar, sleeping sickness and schistosomiasis, could play a role [23][24][25][26]. Polyclonal B cell activation to various infections could account for the heterogeneity in test performance across different populations [27]. In our study, co-morbidities were assessed only by self-reporting, and no significant association with false reactive results could be established.
Interestingly, being referred by a clinician from the IPD, OPD or TB clinic (as a result having one or more co-morbidities) was a risk factor for false reactivity, but only for Determine. In contrast, for Genie Fast and Vikia, the main risk factor associated with false reactive results was male gender with a 2-3fold increased risk. Finally, the origin of the participants was highly associated with false reactivity on the INSTI, SD Bioline and First Response tests, indicating the presence of unknown site-specific factors.
It has been postulated that weak reactive test lines/dots are more likely to be false positive than true positive results and that considering them as potentially negative might reduce false-positive results [2,10,15,18,19,28,29]. We detected weak testing lines only with SD Bioline and First Response, the latter showing weak results on almost 50% of reactive tests for HIV-2. For other tests, however, no weak lines were reported, meaning that even false reactive/positive results produced a line of at least medium intensity. This presumably helped reduce variability between test readers: inter-reader agreement was very high (kappa coefficients ≥0.98) for all tests, in line with WHO recommendations of an inter-reader variability <5% [1].
Specificity for HIV-2 for the SD Bioline and First Response tests was low: 89.8% and 96.1% respectively. This confirms results of the WHO prequalification evaluations, which found that RDTs showed a wide range of cross-reactivity (3-57%) on the HIV-2 line, potentially leading to significant false diagnosis of HIV-2 infections. However, as the concerned RDTs are WHO prequalified, providers and patients may be lead to believe N/A = not applicable. STAT-PAK was excluded because too few false-positive results were obtained. a that they are double-infected or solely infected with HIV-2, a less aggressive form of the virus [7]. Several possible limitations related to the use of RDTs in this study should be noted. First, RDTs are designed for use on fresh specimens; in practice this typically means capillary whole blood. As it happened, this study used plasma samples that had been frozen, shipped, and stored before testing. Some studies have shown differences in sensitivity and specificity when using plasma/serum compared to capillary whole blood [13,28,30]. Second, our evaluation was carried out on one batch of index tests, precluding a comparison between batches. Third, considering the relatively low prevalence of HIV in some testing sites, we decided not to include all consecutive clients but, instead, all consecutive positives and a fixed number of negative clients. In doing so, we introduced verification bias, resulting in a sample that was not representative of the overall population. We therefore performed a weighted analysis to account for the sampling strategy, and acknowledge that these estimates are not as solid as they would be had we carried out consecutive sampling. Last but not least, the simple confirmatory assays need to be evaluated in as part of an algorithm in addition to individual performance.

Conclusions
In summary, the findings of this large multi-centre study indicate that HIV RDT performance can vary greatly according to patient's gender, comorbidities, and other unknown factors associated to geographic location, even within in a single country. By performing all tests in a centralized setting, we show that these differences in performance cannot be attributed to study procedure, end-user variation or storage conditions. Also, simple confirmatory assays in this study had imperfect and varying specificities according to origin of specimens, suggesting that they may not provide an appropriate universal solution in all geographical locations to the problem of false-positive results. Finally, these results underscore the need for local validation of HIV RDTs in order to design accurate testing algorithms.