Prevalence of Clinically Relevant UGT1A Alleles and Haplotypes in African Populations


Corresponding author: Dallas M. Swallow, Department of Genetics, Environment and Evolution, University College London, Wolfson House, London NW1 2HE, UK. Tel: 0207-679-5040; Fax: 0207-387-3496; E-mail:


Variation of a short (TA)n repeat sequence (rs8175347) covering the TATA box of UGT1A1 (UDP-glucuronosyltransferase1A1) is associated with hyperbilirubinaemia (Gilbert's syndrome) and adverse drug reactions, and is used for dosage advice for irinotecan. Several reports indicate that the low-activity (risk) alleles ((TA)7 and (TA)8)) are very frequent in Africans but the patterns of association with other variants in the UGT1A gene complex that may modulate these responses are not well known. rs8175347 and two other clinically relevant UGT1A variants (rs11692021 and rs10929302) were assayed in 2616 people from Europe and Africa. Low-activity (TA)n alleles frequencies were highest in equatorial Africa, (TA)7, being the most common in Cameroon, Ghana, southern Sudan, and in Ethiopian Anuak. Haplotypic diversity was also greatest in equatorial Africa, but in Ethiopia was very variable across ethnic groups. Resequencing of the promoter of a sample subset revealed no novel variations, but rs34547608 and rs887829 were typed and shown to be tightly associated with (TA)n. Our results illustrate the need for investigation of the effect of UGT1A variants other than (TA)n on the risk of irinotecan toxicity, as well as hyperbilirubinaemia due to hemolytic anaemia or human immunodeficiency virus protease inhibitors, so that appropriate pharmacogenetic advice can be given.


UDP-glucuronosyltransferase 1A isoform 1 (UGT1A1) is a phase II drug metabolizing enzyme responsible for converting a wide array of drugs to water-soluble glucuronides suitable for renal or biliary elimination (MIM*191740). UGT1A1 is also the main isozyme capable of conjugating bilirubin, the endogenous yellow pigment resulting from natural haeme catabolism (Bosma et al., 1994).

The inherited hyperbilirubinaemia known as Gilbert's syndrome (MIM*143500), for which intermittent episodes of jaundice are the most widely recognised symptom, is attributable to reduced UGT1A1 activity (Bosma et al., 1995). Gilbert's syndrome has been studied mainly in European and East Asian populations where its prevalence is estimated at 3–9% (Kornberg, 1942; Bosma et al., 1995; Owens & Evans, 1975; Gwee et al., 1992; Buyukasik et al., 2008). The underlying genetic cause in most populations is considered to be homozygosity for seven thymine–adenine repeats (TA)7 (UGT1A1*28, rs8175347) in the TATA box promoter motif (Bosma et al., 1995; Borlak et al., 2000) and mean bilirubin levels of (TA)7 homozygotes are approximately double those of (TA)6 homozygotes (Lampe et al., 1999; Premawardhena et al., 2003; Lin et al., 2006). Although neurotoxic at very high levels, particularly in children, as a potent antioxidant, moderately elevated bilirubin has been proposed to protect against adult oxidative stress-mediated diseases (Stocker et al., 1987). Indeed, strong negative associations have been observed between bilirubin level and incidence of cancer and cardiovascular disease (Novotny & Vitek, 2003; Zucker et al., 2004; Temme et al., 2001). Raised bilirubin levels can also inhibit replication in vitro of various blood pathogens including pneumococcus, the malaria parasite Plasmodium, and human immunodeficiency virus (HIV) (Najib, 1937; McPhee et al., 1996; Kumar et al., 2008).

The prevalence of (TA)7 homozygosity in European populations is 6–10% (Premawardhena et al., 2003). Even higher frequencies have been reported in sub-Saharan Africa (Premawardhena et al., 2003). Though rarely identified in other populations, two additional repeat alleles, (TA)5 and (TA)8, are also present at low frequency in people of recent African descent (Beutler et al., 1998; Premawardhena et al., 2003). There is a negative association between UGT1A1 expression and repeat length of the four alleles, attributable to decreasing promoter activity acting via altered affinity for the TATA-binding protein (Beutler et al., 1998; Hsieh et al., 2007). Although the (TA)n alleles appear to have similar effects on bilirubin levels in people of recent African descent (Chaar et al., 2005; Hong et al., 2007; Carpenter et al., 2008) and low-activity alleles confer significantly raised risk of developing gallstones requiring surgery (Passon et al., 2001; Heeney et al., 2003), Gilbert's syndrome is rarely diagnosed in Africa (Bougouma et al., 1999).

Homozygosity for (TA)7 has also been associated with adverse drug reactions (ADRs) due to reduced clearance, most notably life-threatening toxicity to chemotherapy with high-dose irinotecan (Hoskins et al., 2007). Data supporting this association led the Food and Drug Administration (FDA) in 2004 to alter the label to recommend a lower starting dose for patients with the (TA)7/(TA)7 genotype (New Drug Application 20-571). Severe hyperbilirubinaemia following treatment with the HIV protease inhibitors indinavir and atazanavir is also much more frequent in (TA)7 homozygotes due to the inhibitory effect of these drugs on UGT1A1 activity (Danoff et al., 2004; Zhang et al., 2005; Lankisch et al., 2006; Rodriguez-Novoa et al., 2007). However, it is likely that the pharmacogenetic effects of (TA)7 are confounded by additional common functional variants located in the UGT1A1 regulatory regions and in the other enzymes encoded by the UGT1A gene complex (Lankisch et al., 2006). This gene complex encodes the nine UGT1A isoforms and in Europeans and East Asians a region of strong linkage disequilibrium (LD) extends across much of the complex (about 90 kb) (Innocenti et al., 2005). Several other “low-activity” alleles reside on the same haplotype as (TA)7 in European populations (Innocenti et al., 2002; Kohle et al., 2003; Innocenti et al., 2005; Menard et al., 2009), and probably play a role in outcomes associated with irinotecan and HIV therapy (Lankisch et al., 2006; Lankisch et al., 2009). Lower levels of LD across the UGT1A gene complex have been reported in African–Americans and for the Yoruba of Nigeria (Innocenti et al., 2002; Odeberg et al., 2006; Hong et al., 2007), but studies for other indigenous African groups are lacking.

It is increasingly clear that the people of the African continent show higher levels of genetic diversity and population substructure than most human populations on other continents. The study of pharmacogenetically relevant variation in Africa is thus particularly important for identifying groups potentially at risk of poor drug response or ADRs. While cancer therapy with irinotecan must be comparatively rare on the African continent, this drug is used to treat people of recent African descent in the United States and Europe. Also the possible implications of UGT1A variation with respect to the HIV treatments that are subsidised for use across Africa, and the negative interaction of low-activity UGT1A1 (TA)n promoter variants, with inherited blood disorders common in parts of Africa, are of considerable clinical importance in Africa itself (Chaar et al., 2005; Kaplan et al., 2008).

Our first aim was therefore to establish the allele frequencies of (TA)n in different parts of the continent in relation to geography and ethnic origin. The second aim was to determine whether there are differences, across these defined populations, in the haplotype backgrounds of the (TA)7 allele with respect to other functional single-nucleotide polymorphisms (SNPs), which might indicate greater functional diversity both in Africa and in people of recent African descent. For this study, two SNPs were selected that are thought to play a role in irinotecan metabolism and toxicity (Innocenti et al., 2004; Cote et al., 2007). In order to determine whether there is any further variation in the immediate promoter that might modulate expression in some Africans, we also sequenced the region immediately upstream of the start of translation of UGT1A1 in a subset of the samples.

Materials and Methods


The 2316 buccal DNA samples analysed in this study are part of an in-house collection assembled by The Centre for Genetic Anthropology at University College London. All samples were collected from ostensibly healthy individuals unrelated at the paternal grandfather level and were anonymous, since names were not recorded. They were collected between 1998 and 2007 with informed consent and ethical approval (UCLH 99/0196). The samples tested were from 18 countries across six geographic regions defined as follows: North Europe (NE), the Middle East (ME), North Africa (NA), West Africa (WA), Central East Africa (CEA), and South East Africa (SEA) (Veeramah et al., 2008). Self-reported cultural identity/ethnicity and language details were also available for the majority of the panel. For the most detailed analyses, country subgroups of 40 or more individuals of the same self-declared cultural identity/language group were tested separately.


For this study, two SNPs were selected in addition to (TA)n variant: the rs10929302 (−3156G > A, UGT1A1*93) located in the phenobarbital response enhancer module (PBREM) located approximately 3 kb upstream of the (TA)n variant that has been claimed to better predict irinotecan toxicity (Innocenti et al., 2004; Cote et al., 2007), and the nonsynonymous SNP rs11692021 (Trp208Arg, UGT1A7*3) located in the substrate-binding exon of UGT1A7 (MIM*606432) located approximately 90 kb upstream from (TA)n, which reduces glucuronidation of SN-38, the active metabolite of irinotecan. All three loci are in strong LD in European populations (Kohle et al., 2003; Innocenti et al., 2002).

The (TA)n variant was assayed by a previously reported technique using high-percentage polyacrylamide gels (Sampietro et al., 1998). The selected SNPs were assayed using TaqMAN technology (Applied Biosystems, Foster City, CA). TaqMAN probes were designed by Applied Biosystems and polyermase chain reactions were performed in 384-well microplates using a gradient cycler. TaqMAN probes are reported in the supplementary Table S1A. Fluorescence was measured using an ABI Prism 7000 (Applied Biosystems, Applera, UK, Warrington, Cheshire, UK) sequence detection system, and genotypes were assigned with 95% confidence using ABI Prism 7000 SDS software version 2.1. A batch of 368 samples from African and non-African groups was first tested to check that there was adequate allelic variation in the populations under study, and of these, 156 samples were replicated in the larger panel to validate typing. Call rates were >95% for rs10929302 and >92% for UGT1A7 rs11692021. In all instances, researchers were blind to the sample origin at the time of typing.

A region upstream (−380) and downstream (+60) of the ATG start site of UGT1A1 (∼−330 from the (TA)n sequence to ∼+100 of the (TA)n sequence) was resequenced in a subset of 372 African samples to represent each geographic region (65 from Algeria [NA], 82 from Cameroon [WA], 148 from Ethiopia [CEA], 77 from Malawi [SEA] and included most from Ethiopia which is the most diverse country) using an ABI 96-capillary 3730xl DNA Analyzer (Applied Biosystems, Applera, UK) (see Table S2 for sequencing primers). This allowed typing of rs34547608 (at −52 bp from (TA)n) and rs887829 (at −310 bp from (TA)n)).

In all cases, genotypes were inferred assuming no silent alleles.

Data Analyses

All analyses were performed using Arlequin 3.1 unless otherwise specified (Excoffier et al., 2005).

Exact tests for deviation from Hardy–Weinberg equilibrium were performed (using 10,000 steps in a Markov chain; 10,000 dememorization steps). For display on the map in Figure 1, (TA)n genotypes were recoded into three “expression” phenotypes using groups assigned from bilirubin levels in a study on people with recent African ancestry (African-Caribbean) (Chaar et al., 2005). Comparisons of genetic distances between populations (regions, countries, and ethnic groups) based on (TA)n genotype frequencies were made by calculating pairwise FST values (10,000 permutations). Because of the large number of different ethnic groups with very few individuals, we limited the analysis of ethnic groups to those with at least 40 members (n= 1838). To visualize these differences, principal coordinates analysis (PCO) was performed on FST matrices within R-programming environment using routines in the APE package. Genetic similarity was quantified as being equal to the value of FST subtracted from one. Values along the main diagonal, which represent the similarity of each population to itself, were calculated from the estimated genetic distance between two copies of the same sample by the formula n/(n−1).

Figure 1.

Distribution of UGT1A1 (TA)n rs8175347 genotypes categorized as low ((TA)7/7, (TA)7/8, or (TA)8/8), intermediate ((TA)6/7 or (TA)6/8), and high ((TA)5 or homozygous (TA)6/6) expression genotypes across countries and country subgroups. See Table 1 for full details of groups. Bantu is short for Bantu language speakers. Note that the frequencies of the low-activity alleles in the different country groups are significantly higher in the equatorial belt (+10 to −10 latitude) than elsewhere, p= 0.000015, Student's t-test and also significantly higher for the equatorial belt than the rest of Africa (p= 0.00019). However also note the interethnic differences, particularly in Ethiopia.

The D′ measure of LD between the three genotyped loci was calculated using LDMax which uses the expectation-maximization algorithm to determine phase and is available as part of the GOLD software package ( Haplotypes were inferred using PHASE v2.1.1 (100 iterations; 500 burn-in). The resulting haplotype frequencies were used to calculate Nei's gene diversity index (h) and population differentiation using exact tests (Markov chain length 100,000 steps). Where appropriate, the standard Bonferroni correction for multiple testing was applied by multiplying the significance value by the number of comparisons.


UGT1A1 (TA)n Allele Frequencies

The allele frequencies of UGT1A1 (TA)n. are presented in Table 1 (see Table S3 for genotype frequency data). The allele frequency of (TA)7 ranged from 0.32 in Yemen and the Chewa of Malawi to 0.60 in the Anuak of Ethiopia. In Tanzania, Uganda, southern Sudan, Nigeria, Ethiopian Anuak, and all ethnic groups in Cameroon and Ghana, (TA)7 is the most common variant. The (TA)5 and (TA)8 alleles, which were not detected in the British sample, were present at low frequencies in all of the other groups tested. Overall, the (TA)5 allele was more prevalent than (TA)8 and reached a frequency of 0.10 or above in five of the 13 sub-Saharan African countries. The Ethiopian Anuak was the only sub-Saharan ethnic group without a single occurrence of (TA)5. Although this is a dinucleotide repeat (or microsatellite) polymorphism, no novel alleles were identified.

Table 1.  Allele frequency (≥1%) by country and country subgroup (based on self-declared cultural identity/ethnic group or language group) of (TA)n and the two SNPs, rs10929302 and rs11692021. Thumbnail image of

The geographic and ethnic distribution of inferred low-, intermediate- and high-expression phenotypes based on recoded genotype data are presented in Figure 1. The distribution shows that low-activity genotypes are highly prevalent in equatorial regions of Africa and that Ethiopia has the highest within country interethnic group variability.

Pairwise FST results and associated p-values are shown in supplementary Tables S4A–C. The pairwise Fst values show significant differentiation between sub-Saharan African regions and regions outside of sub-Saharan Africa, (though for CEA, statistical significance did not remain after Bonferroni correction). However, there was little differentiation between countries within regions or between ethnic groups within countries in most cases. The exceptions were Senegal in the WA region and the Ethiopian ethnic groups. A PCO plot derived from pairwise Fst measurements between all the distinct ethnic/language groups shows that while the SEA groups cluster, the CEA and WA are more differentiated (Fig. 2). The increasing values on the first principal component axis broadly correspond to increasing (TA)7 frequencies.

Figure 2.

Principal coordinates plot of the pairwise FST values for the country subgroups. Calculated using UGT1A (TA)n frequency data. NE = Northern Europe; ME = Middle East; NA = North Africa; WA = West Africa; CEA = Central East Africa; SEA = South East Africa. See Table 1 for full details of groups. Bantu is short for Bantu language speakers. See supplementary Tables 4A–C for the pairwise FST data and p-values. This plot shows the clustering of the SEA groups that contrasts with the much greater genetic distances between the CEA groups.

Analysis of the Two SNPs, rs11692021 and rs10929302

The allele frequencies of the two SNPs by country and ethnic group are shown in Table 1. The globally minor allele of the UGT1A7 nonsynonymous SNP rs11692021 was at highest frequency in the countries and individual ethnic groups in the CEA region (range: 0.33–0.53), at relatively lower frequency in SEA (range: 0.15–0.29) and at intermediate frequency in WA and the regions outside of sub-Saharan Africa (range: 0.21–0.41). A similar pattern was seen with UGT1A1 rs10929302, though the differences were less marked.

Variability of LD in Different Countries and Ethnic Groups

Pairwise D′ values, which give a measure of recombination, were calculated using data from countries and ethnic groups separately. There were distinct differences in the patterns of LD in each of the groups (see supplementary Table S5 for D′ values). Samples from the countries outside of sub-Saharan Africa showed the highest LD, with D′ of greater than 0.92 between the (TA)n and the UGT1A1 PBREM SNP rs10929302. The CEA region showed the lowest level of LD. Within Ethiopia, significant LD was detected across all three pairs of loci in the Anuak but for none in the Oromo.

Haplotype frequencies estimated using PHASE are presented in Table 2. Very similar frequencies were obtained using the expectation-maximization algorithm (data not shown). The haplotype frequencies and estimated diversity indices (see supplementary Fig. S1 for Nei's h values) show that haplotypic diversity is greater in sub-Saharan Africa. The haplotype encompassing all three “high” activity alleles (TG6: ancestral T allele of rs11692021 and the G allele of rs10929302 together with (TA)6) is the most prevalent in all groups except for the Ethiopian Anuak (where the low-activity haplotype CA7 is slightly more frequent). The only other major haplotype background for (TA)6 was CG6. Overall (TA)7 occurs most frequently as part of haplotype CA7 (derived C allele of rs11692021 and the derived A allele of rs10929302) but has a more diverse haplotypic background in the sub-Saharan groups where the frequencies of TA7 and TG7 were found to be over 0.10 in many instances Although (TA)8 is relatively rare, its haplotypic background appears the most variable, while the other rare allele (TA)5, has only one major background (TG5), again a combination with the ancestral SNP alleles (see Table S7 for comments on ancestral alleles). Exact tests of population differentiation using haplotype frequencies (see supplementary Table S6 for p-values) show the most genetic differentiation between the Europeans (British and Turkish), and all others, and also between the Ethiopian Amhara and Oromo, and all others (including the Ethiopian Anuak).

Table 2.  Estimated haplotype frequency (>1%) by country and country sub-group. Thumbnail image of

Resequencing of the UGT1A1 Promoter

As a pilot study to check the promoter sequence context of the low-activity (TA)n alleles in Africans, sequence of the immediate UGT1A1 promoter region (−250 from the (TA)n sequence to +100 of the (TA)n sequence) was scanned for a total of 372 samples. No novel variation was identified, but the previously reported rs34547608 and rs887829 were found and typed in all 372 individuals. These SNPs do not significantly increase the haplotypic diversity, the rs34547608 C allele being very tightly associated with the (TA)5 allele (confirming previous reports based on data from 101 African–Americans (Beutler et al., 1998), and the rs887829 T allele being tightly associated with (TA)7 and also (TA)8. Inferred three locus haplotype frequencies are reported in Table 3, and five locus haplotypes in supplementary Table S7.

Table 3.  Haplotypes comprising the three promoter loci (from left rs887829, rs34547608, and (TA)n) inferred using PHASE for a subset of African samples (n= 372).
Promoter haplotypeNACEAWASEAFrequency
  1. Haplotypes named according to the allele composition. See Table S3 for details of the extended haplotypes and the methods section for details of samples. It can be seen that for the vast majority of cases (98.7%) that the C allele of rs887829 is found with (TA)5 or (TA)6 and the T allele is found with (TA)7 or (TA)8.

CC5   0.0090.0370.1560.042
CT7  0.0080.028 0.0060.007
TT60.008    0.0060.003
CC6     0.0060.001
CT5    0.006 0.001
CT8     0.0060.001
TT5     0.0060.001
Total chromosomes13050138108164154 


In this paper, we confirm previous observations that the promoter variant (TA)7 of UGT1A1, which is associated with reduced UGT1A1 activity, hyperbilirubinaemia, and specific ADRs, occurs in a region of strong LD in non-African populations. However in Africa, where there are also more (TA)n alleles (TA5,6,7 and 8), we show more heterogeneity of haplotype background as well as large differences in the frequency of the alleles in different regions. Overall there is a geographic trend. The low-activity (TA)n genotypes are more prevalent in the equatorial regions but the haplotype diversity is greater.

There are also differences between ethnic groups within a single country, and these are statistically significant in the case of Ethiopia. The Anuak show much higher frequencies of the (TA)7 allele and the low-activity haplotype, while the Oromo show a very high level of haplotype diversity. These distributions may simply reflect demography but it is interesting to note an apparent correspondence to the distribution of malaria. For example, the Ethiopian Anuak with the highest frequencies of (TA)7/8 live in the low-lying western regions around Gambella, where malaria is endemic, whereas the Amhara and Oromo, with lower frequencies, live in the eastern highland regions where malaria is infrequent or absent. It is noteworthy that high levels of unconjugated bilirubin can inhibit P. falciparum replication, suggesting that low UGT1A1 activity may possibly have conferred a selective advantage by protection from malaria, similar to other genetic traits such as glucose-6-phosphate dehydrogenase (G6PD) deficiency and sickle cell anaemia (Kumar et al., 2008). Others have noted that high frequencies of the (TA)7 allele occur in other areas where malaria is endemic such as much of the Indian subcontinent (Premawardhena et al., 2003).

For drugs, such as irinotecan, a combination of (TA)7 and functional SNPs in other UGT1A isoforms, such as UGT1A7, has been proposed to be a better predictor of drug toxicity (Lankisch et al., 2008), but these coexist on the same haplotype (CA7) so that the whole haplotype is predictive of risk, and it is hard to separate the effects of the TATA box variation from that of other functional SNPs. In the African populations studied here the situation is quite different and recombination has separated the low-activity alleles. In the TA7 haplotype, for example, (frequency 0.22 in the Tanzanian sample) the low-expression (TA)7 allele is on the same chromosome as the high-activity UGT1A7 allele while the converse is true of the CG6 which is frequent in the Ghanaian Bulsa. When, in 2004, the FDA approved a commercial test to predict a potentially fatal response to irinotecan therapy, they did not consider the complexity of the possible interaction with other functional SNPs in the UGT1A1 regulatory elements and within other UGT1A isoforms, which are predicted to lead to intermediate phenotype. Thus many African–Americans may be prescribed doses of the drug based on advice that might not be relevant for people of all ancestries.

The results described in this paper, in particular the evidence of greater haplotype diversity across the UGT1A complex, in sub-Saharan Africa than in Europe, emphasise the need for further investigation of the effect of the other functional UGT1A variants in addition to (TA)n on the risk of hyperbilirubinaemia due to interactions with hemolytic anaemia, or treatment with HIV protease inhibitors, as well as to the risk of irinotecan toxicity. In addition, further resequencing of the rest of the UGT1A gene complex in people of African ancestry is indicated. Our pilot resequencing of the UGT1A1 promoter, in a diverse sample set, however, failed to identify novel SNPs and typing of the previously reported rs887829 and rs34547608 showed that there is very little recombination with (TA)n, so that even if these alleles modulate the function of (TA)n, the effect of this would be seen only in rare individuals. The haplotype and ethnicity information reported here will help in the construction of appropriate phenotype–genotype association studies and development of better diagnostic tests. As well as testing other functional variants, it seems reasonable to suggest that rs887829 might be useful diagnostically as a marker for (TA)7 and (TA)8 since it would be easier to incorporate into multiplex assays than the microsatellite. This would also provide a solution to the problem that the (TA)8 allele cannot be typed in the commercial assay (Invader® UGT1A1 molecular assay package insert) despite its importance as a risk allele.


We thank all the sample donors, and also the DNA collectors: Leila Laredj, Matthew Forka, Liz Caldwell, M. le Roux, Pieta Näsänen, Tudor Parfitt, Tankei Helenius, Dr. Fouad Berrada, Esther William, D. Gomis, H. Babiker, J. Course, Hicram, James Wilson; Ranji Arasaretnam, Mari Wyn Burley, Heather Elding, and Anke Liebert for help with electrophoresis and sequencing; the Melford Charitable Trust for providing funding; Dr. Stephen Pereira for helpful discussion.