HIV misdiagnosis in sub-Saharan Africa: performance of diagnostic algorithms at six testing sites

Abstract Introduction: We evaluated the diagnostic accuracy of HIV testing algorithms at six programmes in five sub-Saharan African countries. Methods: In this prospective multisite diagnostic evaluation study (Conakry, Guinea; Kitgum, Uganda; Arua, Uganda; Homa Bay, Kenya; Doula, Cameroun and Baraka, Democratic Republic of Congo), samples from clients (greater than equal to five years of age) testing for HIV were collected and compared to a state-of-the-art algorithm from the AIDS reference laboratory at the Institute of Tropical Medicine, Belgium. The reference algorithm consisted of an enzyme-linked immuno-sorbent assay, a line-immunoassay, a single antigen-enzyme immunoassay and a DNA polymerase chain reaction test. Results: Between August 2011 and January 2015, over 14,000 clients were tested for HIV at 6 HIV counselling and testing sites. Of those, 2786 (median age: 30; 38.1% males) were included in the study. Sensitivity of the testing algorithms ranged from 89.5% in Arua to 100% in Douala and Conakry, while specificity ranged from 98.3% in Doula to 100% in Conakry. Overall, 24 (0.9%) clients, and as many as 8 per site (1.7%), were misdiagnosed, with 16 false-positive and 8 false-negative results. Six false-negative specimens were retested with the on-site algorithm on the same sample and were found to be positive. Conversely, 13 false-positive specimens were retested: 8 remained false-positive with the on-site algorithm. Conclusions: The performance of algorithms at several sites failed to meet expectations and thresholds set by the World Health Organization, with unacceptably high rates of false results. Alongside the careful selection of rapid diagnostic tests and the validation of algorithms, strictly observing correct procedures can reduce the risk of false results. In the meantime, to identify false-positive diagnoses at initial testing, patients should be retested upon initiating antiretroviral therapy.


Introduction
HIV testing algorithms based on rapid diagnostic tests (RDTs) are widely used in HIV testing and counselling (HTC) programmes in areas with limited laboratory infrastructure [1]. RDTs for HIV are low cost, require no cold chain for storage and only minimal training to operate and provide same-day results [2,3]. Algorithms using RDTs are thus ideal for use in resource-constrained settings lacking the laboratory infrastructure and human and financial resources to support the use of more complex techniques, such as enzyme-linked immuno-sorbent assay (ELISA) or immunoblots. To diagnose HIV in these contexts, the World Health Organization (WHO) recommends the sequential use of two or three RDTs in high-and lowprevalence HIV settings, respectively [1]. Unfortunately, these recommendations have yet to be widely implemented: many countries still use serial tiebreaker algorithms wherein two out of three positive RDTs constitute an HIVpositive diagnosis [4]. Furthermore, the WHO recommends using serological assays/RDTs with a sensitivity of at least 99%; the first RDT should have a specificity of at least 98%, while the second and third RDTs should have a specificity of at least 99%. Despite the good performance of numerous individual RDTs in recent WHO evaluations [2,3], false-positive results have been reported from projects operated by Médecins sans Frontières (MSF) [5][6][7], a humanitarian emergency organization, and by other actors [7][8][9][10][11][12][13][14][15][16][17]. A false-positive result is likely to be psychologically traumatic to the patient and may trigger inappropriate, potentially harmful treatment [6]. Additionally, reporting false-positive results, even if due to a test's technical limitations, can undermine patient confidence in the HTC centre [6].
We conducted a standardized multicentre study in six sites in sub-Saharan Africa to evaluate the performance of HIV testing algorithms routinely used across the region in real-life conditions. The objectives of this study were to evaluate the performance of the algorithms used in each site and performed in routine conditions; to evaluate the accuracy of the most commonly used RDTs under controlled laboratory conditions and deduce the performance when combined in algorithms following WHO recommendations; and compare these real conditions and ideal performance to WHO-recommended thresholds. Here, we describe the accuracy of the algorithms performed in these sites in routine conditions as compared to a stateof-the-art algorithm of the AIDS reference laboratory at the Institute of Tropical Medicine (ITM), Antwerp, Belgium.

Study settings
This multicentre study took place within six HTC sites in sub-Saharan Africa:

Study population and inclusion
The study population included all clients greater than equal to five years old attending one of the study HTC sites for HIV testing who provided written informed consent to participate in the study. All participants were invited to sign a separate form for tracing in case of a misdiagnosis.
After enrolment, clients were counselled and tested for HIV according to site-specific procedures and testing algorithms. In addition, blood samples were drawn and EDTA plasma prepared and stored at −20°C for transfer to the reference laboratory.

Sample size and sampling strategy
We calculated a sample size of at least 200 algorithm HIVpositive and 200 algorithm HIV-negative samples from clients at each study site based on the assumption that both sensitivity and specificity of both sample sets were 98%, providing a 95% confidence interval of ≤±2% for both sensitivity and specificity.
If the prevalence of positive results at the HTC site was between 40% and 60%, we collected all consecutive samples and calculated the total sample size based on prevalence in order to obtain at least 200 HIV-positive and 200 HIV-negative samples (i.e. highest of 200/p or 200/ (1 − p) − maximum sample size 500). This sample size was then increased by 10% to account for losses, problems in shipment, or sample integrity.
If the prevalence was below 40%, we obtained a subset of HIV-positive and HIV-negative samples (according to the algorithm in place). Since the HIV testing algorithms were expected to be relatively accurate, we anticipated very few misdiagnoses. Conservatively, assuming 10% misdiagnosis, we collected a subsample of 220 HIV-positive and 220 HIVnegative samples according to the algorithm. This would ensure that we have at least 200 true-positive and 200 true-negative samples. All samples with inconclusive algorithm results (i.e. discordant with two RDTs in sites not using a tiebreaker) were also collected, along with a backup sample from each participant in case of shipment problems or for potential retesting on site. Every misdiagnosed participant who had consented to tracing was subsequently traced; if the participant had consented, a new sample was collected and tested to exclude the possibility of clerical and other errors.
Testing strategies and algorithms at study sites Testing strategiesserial versus parallel testing, with and without confirmatory testingvaried among the six study sites (Table 1 and Figure 1). The sample types included capillary whole blood and EDTA plasma. All study sites used Determine HIV-1/2 (Alere, USA) as the first test in the algorithm while at two sites, Baraka and Kitgum, parallel testing with Determine HIV-1/2 and another RDT was performed. The second and third tests used were ImmunoFlow HIV 1-HIV 2 (Core Diagnostics, UK), Uni-Gold HIV (Trinity Biotech, Ireland), HIV 1/2 Stat-Pak (Chembio, USA), ImmunoComb II HIV 1&2 BiSpot (Orgenics, Israel) or GS HIV-1/HIV-2 PLUS O EIA (Bio-Rad, USA), as detailed in Table 1. All tests were performed and interpreted by the staff members who routinely perform testing in the programme, including laboratory technicians and/or counsellors trained on the use of the test. No specific training on test procedures was provided as part of the study so that the results would be representative of routine testing methods.
In addition to the on-site testing algorithm, MSF used an alternative algorithm in Kitgum and Baraka that used the ImmunoComb II HIV 1&2 CombFirm (Orgenics, Israel) as a simple confirmatory test following two reactive RDTs. The ImmunoComb is an indirect solid-phase enzyme immunoassay (EIA) containing markers for p24 (gag), p31 (pol) and three env-derived protein spots: gp41, gp120 and gp36. The ImmunoComb was interpreted not as the manufacturer instructs but in accordance with strict criteria proposed in an earlier evaluation [7]. In summary, a reaction of 3-4 spots was interpreted as a positive result, a reaction of 1-2 spots as an indeterminate result and no reaction as a negative result. Gp36 was not considered for the alternative interpretation.
The INNO-LIA HIV I/II Score was used to confirm the presence of antibodies against HIV type 1 (HIV-1), including group O viruses, and type 2 (HIV-2). The INNO-LIA HIV I/II Score detects antibodies against gp120, gp41, p31, p24, p17, gp105 and gp36. If the INNO-LIA HIV I/II Score was negative or indeterminate, the samples were tested with an antigen-enzyme immunoassay (Ag-EIA, i.e. INNOTEST HIV Antigen mAb, Innogenetics NV, Ghent, Belgium) in order to exclude acute infections [2].
If both the LIA and Ag-EIA were negative, the sample was classified as HIV negative. If the LIA was indeterminate and the Ag-EIA negative, the final result was indeterminate. If the LIA was negative or indeterminate and the Ag-EIA was positive (confirmed by neutralization), it was considered a potential seroconversion or acute infection. If the LIA confirmation could not differentiate between HIV-1 and HIV-2 and the outcome was a simple HIV infection, the specimen was tested with an in-house HIV DNA-PCR for HIV-1 and HIV-2 on the DPS. If the DNA-PCR was positive for HIV-1, HIV-2 or for both HIV-1 and HIV-2, the sample was classified as positive for HIV-1, for HIV-2 or for both viruses, respectively.
Data management and statistical analysis EpiData 3.1 software (EpiData, Odense, Denmark) was used to perform data entry at all study sites. At ITM, data were collected in an Excel file. The accuracy of data entry at both ITM and the study sites was monitored by a data clerk who double checked all entries. STATA version 13.1 (StataCorp, College Station, TX, USA) was used to perform statistical analysis.
Results of the testing algorithm at each site were compared to those of the reference standard algorithm to calculate sensitivity, specificity, and predictive values. Participants with an inconclusive result either with the onsite algorithm or with the reference algorithm, as well as those diagnosed as having an acute infection with the reference algorithm, were excluded from the performance analyses. At sites where the sampling strategy introduced verification bias (i.e. Douala, Arua, Kitgum; see sampling strategy), correction was carried out using a Bayesian method proposed by Zhou [21].

Ethics
The study was approved by the MSF Ethics Review Board and the Ethics Committee of the five countries where the study took place. Separate written informed consent was obtained for participation in the study and tracing in case of misdiagnosis.

Results
Between August 2011 and January 2015, 14,015 clients were tested for HIV at six HTC sites, and 2786 (19.9%) were included in the study ( Table 2). The median age was thirty years (IQR: 22-42) and the proportion of males was 38.1% (IQR: 29.6-48.2%). Most study   participants who utilized the HTC service were selfreferred (58.3%) or were referred by a partner (18.6%).
Other testing was provider initiated by the inpatient department (11.9%), outpatient department (1.7%), antenatal care (6.4%) or the tuberculosis clinic (3.0%) within the same health facility (Table 2). Across all testing sites, the HIV positivity rate ranged from 8.0% in Baraka to 63.7% in Conakry (Table 2). Of all 2786 specimens tested at the ITM, 1281 were classified as HIV-1 positive, 1 as HIV-2 positive, 25 as HIV-positive (undifferentiated), 2 as acute infections, 3 as inconclusive and 1474 as negative.
After adjustment for the under-representation of negative results in the study design and exclusion of inconclusive results and acute infections, the sensitivity of the testing algorithms ranged from 89.5% in Arua to 100% in Douala and Conakry ( Table 3). The specificity ranged from 98.3% in Douala to 100% in Conakry. The positive predictive value (PPV) ranged from 96.4% in Douala to 100% in Conakry. The negative predictive value (NPV) ranged from 98.3% in Arua to 100% in Conakry, Douala and Baraka.
Overall, 24 (0.9%) study participants were misdiagnosed, ranging from 0 to 8 participants per site (0-1.7%), with 16 false-positive and 8 false-negative results. Six false-negative specimens were retested using a backup sample and were found to be HIV positive ( Table 4). Thirteen of 16 falsepositive specimens were similarly retested with the backup sample and 10 remained positive (reactive with the 2 first RDTs), while 2 remained positive with only 1 RDT. In addition, in Douala, all eight clients with a false-positive result were traced and a new sample was drawn to exclude clerical error as the cause of misdiagnosis. One client had a negative result, one had an indeterminate result and the remaining six of eight clients maintained a reactive result with both RDTs. All specimens were again found to be negative by the reference algorithm (Table 4).
Detailed testing results of 99 participants with discordant results between the first two RDTs are described in Table 5. The number of discordant results by site varied from four in Conakry to 54 in Baraka: this high number is explained the fact that all discordant results, considered inconclusive, were included in the study, and by the long duration of the recruitment period. It should be noted that the proportion of inconclusive results among all clients tested in this site during the study period was 1.9%, which was only marginally higher than the proportion of inconclusive results in other sites where discordant results were classified as inconclusive (Table 2). Of these 99 discordant results, the majority (n = 91) were negative by the reference standard (Table 4).
Of the two clients in Kitgum classified as having acute infections, one was diagnosed as positive by the national algorithm on site (indeterminate by the MSF algorithm), and the other as negative.

Discussion
Reports of unacceptably high error rates in RDT-based HIV testing in some resource-constrained settings [5][6][7][8][9][10][11][12][13][14][15][16][17] led us to conduct a large, multisite study to assess the performance of testing algorithms used at six sites in sub-Saharan Africa. The results of the HIV testing algorithms routinely used in these sites were compared to those of an internationally recognized reference algorithm from the AIDS reference laboratory for HIV at ITM, Antwerp. Here, we show that the performance of testing algorithms failed to meet the WHO-recommended thresholds of a PPV of ≥99% [1] at Kitgum, Arua, Douala and Baraka. However, at Kitgum and Baraka, the 99% threshold could be exceeded when using an alternative algorithm including a simple confirmatory assay.
While the sensitivity and NPVs of testing algorithms were excellent (100%) at three of the six study sites (Conakry, Doula and Baraka), results at other sites showed lower sensitivities and NPVs, particularly after adjustment for the under-representation of algorithm-negative specimens. Indeed, in sites such as Arua, where less than 10% of those screened negative on site were included in the study, finding one false-negative study participant might mean that up to 10 false-negative clients would have been found if all had been tested using the reference standard. However, all false-negative results were later found to be positive when samples were tested a second time using the same on-site algorithm, except in Kitgum, where backup samples were not available at the end of the study. This finding suggests that most false-negative results could have been due to an improper procedure or misinterpretation of the test in Arua. Alternatively, this difference could be attributed to the specimen used; while initial testing was performed on capillary whole blood at these sites, the backup sample was plasma. Although manufacturers are requested to show equivalence between recommended sample types, RDTs on serum/plasma have been reported to have higher sensitivity and lower specificity compared to RDTs on capillary whole blood [22,23].
The on-site algorithms specificities and PPV were also suboptimal (i.e. PPV <99%) in four sites using national or local algorithms (Kitgum, Arua, Douala and Baraka). We identified clients who had been misdiagnosed as false-positives at all but one study site. Most of the false-positive results were confirmed upon retesting of the backup sample. In Douala, six of eight clients with an initial falsepositive result remained false-positive with the RDT algorithm using a fresh sample but were found to be negative by the reference test, indicating an intrinsic problem with the RDT algorithm.
Several reasons could be proposed to explain the suboptimal PPV of these RDT algorithms including low specificity of individual RDTs. First of all, it should be noted that none of the algorithms used in the study sites followed the current WHO guidelines for HIV testing [1]. Although the use of a tiebreaker is not recommended by the WHO, it is common in the WHO African region [4] and a tiebreaker was used in three of our study sites. The use of a tiebreaker has been associated with a higher risk of false-positive results [1,9,24]. However, in this study, only one of 16 false-positive results was due to the use of a tiebreaker, whereas all others had reactive results with the first two assays. This suggests that        [25]. However, data from different studies suggest that their performance in real-life conditions may vary [5,[7][8][9][10][11][12][13][14][15][16][17].
In response to previous reports of false-positive results, MSF proposed an alternative algorithm containing a simple confirmatory assay to reduce the number of false-positive results [7]. Using this assay in Baraka increased the specificity and PPV of the alternative algorithm, as compared to the local algorithm. However, the use of a simple confirmatory assay, as opposed to a RDT-only algorithm, still needs to be evaluated and balanced with cost and ease of use. The use of a third RDT to confirm a positive result, as proposed in the latest WHO recommendations for low-prevalence settings, might also increase the PPV of the algorithm [1].
As is common practice in diagnostic test evaluations, inconclusive results were excluded from the analysis of sensitivity and specificity. However, given their significant psychological and pragmatic impact, an overall performance evaluation of the algorithm should take these into consideration. Here, the proportion of inconclusive results among study participants should be interpreted with caution since it was not representative of the overall proportion due to our sampling strategy. It is also important to note that inconclusive results were only reported in sites where a tiebreaker was not used. In these sites, the overall proportion of inconclusive results varied from 0.2% to 1.9%. Whether this reflects geographical variability or is caused by immunological factors, as reported previously [26,27], or differences in testing algorithms should be investigated further.
The current WHO guidelines for HIV testing suggest performing a third RDT following discordant results with the first two RDTs, and, if this third RDT is non-reactive, recording the final result as negative. Importantly, this third RDT is not used as a tiebreaker since a reactive result with this third RDT would lead to an inconclusive result. Still, this approach would likely yield a smaller proportion of inconclusive results, most of which, as we have shown, were found to be negative by the reference standard.
There were several limitations to this study. First, due to the low prevalence of HIV in certain sites, we selected study participants based on their on-site test results introducing a verification bias. We attempted to correct for this bias using a method proposed by Zhou [21], but this led to wide confidence intervals for some estimates. Additionally, while several sites performed on-site testing on capillary whole blood, retesting of backup samples and reference test was always performed on plasma. Though this allows for a performance assessment of the algorithms in real-life conditions, it restricts investigation of false results. The lack of systematic testing of backup samples in certain sites also made it difficult to discern the cause of discrepancies between on-site and reference results.

Conclusions
This large multicentre study of the performance of HIV testing algorithms in sub-Saharan Africa highlights the inconsistent performance of HIV testing algorithms. While suboptimal sensitivities of testing algorithms could be the product of procedural mistakes, an inadequate RDT algorithm in Douala was responsible for suboptimal specificity and PPV. Alongside quality issues, such as respecting incubation time, correct labelling, batch control and careful selection of HIV, RDTs for the algorithms in use should be conducted regularly in order to minimize the risk of misdiagnosis. National authorities should also ensure that their policy aligns with the most current WHO recommendations, in terms of both algorithm design and implementation of other strategies to mitigate against misdiagnosis, such as retesting at the start of antiretroviral therapy.
Funding MSF's Innovation Fund provided funding for sample collection at the study sites, shipment to the central laboratory and analysis at the ITM central laboratory. The study sponsor had no role in the study design, data collection, analysis and interpretation of the data or in the decision to submit for publication. The corresponding author had access to all data and final responsibility for the decision to submit for publication.