*Current addresses: MRC Prion Unit, University College London Institute of Neurology, Queen Square House, Queen Square, London WC1N 3BG, UK
**Institute of Genetics, School of Medicine, Queen's Medical Centre, University of Nottingham, Nottingham NG7 2UH, UK
***Molecular Endocrinology Group, Faculty of Medicine, Imperial College London, Clinical Research Building, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK
Expression of lactase in the intestine persists into adult life in some people and not others, and this is due to a cis-acting regulatory polymorphism. Previous data indicated that a mutation leading to lactase persistence had occurred on the background of a 60 kb 11-site LCT haplotype known as A (Hollox et al. 2001). Recent studies reported a 100% correlation of lactase persistence with the presence of the T allele at a CT SNP at −14 kb from LCT, in individuals of Finnish origin, suggesting that this SNP may be causal of the lactase persistence polymorphism, and also reported a very tight association with a second SNP (GA –22 kb) (Enattah et al. 2002). Here we report the existence of a one megabase stretch of linkage disequilibrium in the region of LCT and show that the –14 kb T allele and the –22 kb A allele both occur on the background of a very extended A haplotype. In a series of Finnish individuals we found a strong correlation (40/41 people) with lactose digestion and the presence of the T allele. The T allele was present in all 36 lactase persistent individuals from the UK (phenotyped by enzyme assay) studied, 31/36 of whom were of Northern European ancestry, but not in 11 non-persistent individuals who were mainly of non-UK ancestry. However, the CT heterozygotes did not show intermediate lactase enzyme activity, unlike those previously phenotyped by determining allelic transcript expression. Furthermore the one lactase persistent homozygote identified by having equally high expression of A and B haplotype transcripts, was heterozygous for CT at the −14 kb site. SNP analysis across the 1 megabase region in this person showed no evidence of recombination on either chromosome between the –14 kb SNP and LCT. The combined data shows that although the –14 kb CT SNP is an excellent candidate for the cause of the lactase persistence polymorphism, linkage disequilibrium extends far beyond the region searched so far. In addition, the CT SNP does not, on its own, explain all the variation in expression of LCT, suggesting the possibility of genetic heterogeneity.
Intestinal lactase activity persists into adult life in some people but not others, and the molecular basis of this genetic trait is not understood (Swallow & Hollox, 2000). Lactase persistence is a monogenic trait which is inherited in an autosomal dominant manner. Mono-allelic expression of lactase transcripts in heterozygotes (Wang et al. 1998, 1995), and more recently family studies (Enattah et al. 2002), demonstrated clearly that the polymorphism was controlled by a cis-acting mechanism – i.e. the causative polymorphism was in a cis-acting regulatory element near the lactase gene. This was consistent with early studies that reported a trimodal distribution of lactase activity in jejunal samples from unrelated UK or German adults (Flatz 1984; Ho et al. 1982). Our population genetic data indicated that a mutation leading to lactase persistence had occurred on the background of a particular haplotype of the gene (Harvey et al. 1998; Hollox et al. 2001). This 60 kb 11-site haplotype (A) covering the lactase gene (LCT) is extremely common in Northern Europe where lactase persistence is also common, and lactase persistence is associated with the A haplotype (Harvey et al. 1998), although occasionally lactase persistence was found in combination with a non-A haplotype (Harvey et al. 1998; Wang et al. 1995). A recent study reported a complete association of lactase persistence with the presence of the T allele at a CT SNP at −14 kb from LCT, and suggested that this SNP may be causal of the lactase persistence polymorphism (Enattah et al. 2002). A second SNP at −22 kb was also highly associated. The study was conducted mainly in Finnish individuals, and only 5 samples from persistent non-Finnish individuals were tested. The aims of our study were to determine: the extent of linkage disequilibrium upstream from LCT; whether the T allele at –14 kb and the A allele at –22 kb were mutations uniquely on the background of the A haplotype; and whether 100% correlation of the –14 kb SNP could be found in our series of samples.
A contig centred on LCT was constructed by chromosome walking using three high-density gridded libraries distributed via the UK HGMP Resource Centre. These were the chromosome 2 specific Lawrence Livermore National Laboratory fosmid library LL02NC03 (average insert size 35–40 Kb) and a cosmid library LL02NC02 (average insert size 40 kb) and the PAC library LL02NP04 (85 Kb), and the whole genome PAC library RPCI1 (110 Kb) produced by the Roswell Park Cancer Institute. Single copy probes from D2S442 and LCT were used to screen the libraries and single copy sequence generated from both ends of positive clones. This sequence was used to generate single-copy PCR products, to rescreen the libraries, and chromosome walking was continued using this approach. Mapping the ends of D2S442 positive clones on to previously mapped YACs ( Jarvela et al. 1998) oriented them with respect to LCT, and further orientations were deduced by the patterns of positive and negative signal obtained by PCR amplification of single copy sequence using other clones as templates. BAC clones from the RPCI-11 library, arranged in contigs and sequenced to first draft standard at the Genome Sequencing Center at the University of Washington Medical School at St Louis, were used to extend the map and particularly to identify SNPs on the other side of LCT.
All hybridisations using 32P- labelled probes were performed in 6×SSC (20×SSC = 3M NaCl, 0.3M sodium citrate @ pH7.0, 1×Denhardts solution (100 ×= 2%[w/v] Ficoll, 2%[w/v] Polyvinylpyrrolidone and 2%[w/v] Bovine serum albumin @ pH 7.2) and 50μg/ml of sonicated herring sperm DNA at 65°C. The filter membranes were washed to a final stringency of 0.2×SSC 0.1% SDS at 65°C and subjected to autoradiography using standard procedures.
Clone DNA was isolated with Qiagen Plasmid Kits (Qiagen) using the very low copy protocol. PCR was performed on either 1μl of boiled bacterial culture or 100ng of genomic DNA with 25 picomoles of each primer, 0.2 mM dNTPs, 75 mM Tris-HCl, pH 9.0 at 25°C, 20 mM (NH4)2SO4, 1.5 mM MgCl2, 0.1%[w/v] Tween and 1.25 U Taq Polymerase (ABgen) annealing at 57-59°C for 30 sec and extension at 72°C for 1 min for 32 cycles. PCR products were analysed on 2% agarose and detected by ethidium bromide staining. Direct double-stranded sequencing of clone DNA (∼2μg) with vector primers used the Thermo Sequenase Radiolabelled Terminator Cycle Sequencing Kit (Amersham).
As sequences were obtained they were analysed for the presence of repeat motifs and unique sequences on the databases through the GCG Wisconsin Package of programs at the HGMP Resource Centre, Hinxton Hall Cambridge, UK and programs available through the NCBI website (http://www.ncbi.nlm.nih.gov/).
The French families from the CEPH series (Dausset et al. 1990) were studied in detail. The maximum possible number of extended haplotypes was generated from each family by analysis of parents and selected children and/or grandparents. A few other CEPH samples were used in order to further characterise extended C haplotype chromosomes, which were rare in the French population. These were from individuals 1334: 10, 11, 12 and 13, 1424: 11 and 12; and 1447: 9 and 10.
We lactose tolerance tested 42 unrelated individuals who were all healthy, by giving a 50 g lactose load after an overnight fast, and using three methods for determining tolerance. These were breath hydrogen, urinary galactose and blood glucose (Peuhkuri et al. 1998). The cut-offs described previously were used (Peuhkuri et al. 1998). The “gold standard procedure” (all three measures taken, and diagnosis made on the basis of two or more concordant results) (Peuhkuri et al. 1998) was used in most cases. Breath hydrogen alone was measured in 13 of the earlier cases. Details of symptoms were also recorded. In one case 2/3 results were borderline so this individual was excluded from the study. Ethical approval for this study was obtained from the Joint Authority for the Hospital District of Helsinki and Uusimaa (HUS) Ethics Committee.
A cohort of 48 UK adults, collected in University College Hospitals London (with ethics approval from UCL/UCLH Committee on the Ethics of Clinical Investigations) and used in our previous lactase studies, was also tested (Harvey et al. 1995a; Wang et al. 1995). It should be noted that only cases showing normal villous architecture and immunohistology for sucrase-isomaltase and alkaline phosphatase were included in this cohort. The lactase persistence status of these individuals was determined directly by assay of lactase and sucrase and measuring the sucrase/lactase (S/L) ratio. Eleven of the individuals were diagnosed as non-persistent (S/L ratio of more than 10) and one individual gave an ambiguous result (ratio of 7.7). 18/48 individuals were heterozygous for exonic SNPs (9 lactase non-persistent and 9 lactase persistent) and the relative level of expression of the lactase transcripts could thus be determined. 8/9 of these informative lactase persistent individuals were diagnosed as heterozygous for lactase persistence, since they showed high expression of one allele and low expression of the other. In one case both transcripts were expressed at equal levels, which was considered diagnostic of homozygosity for lactase persistence. The remainder showed low expression of both transcripts, and were lactase non-persistent. Five of the individuals (three persistent and two non-persistent), who are heterozygous for exonic polymorphisms (and thus of known lactase persistence genotype), were selected from this cohort and constituted the panel of samples for sequencing and SNP searching. Genomic DNA used was from lymphoblastoid cell lines derived from the peripheral blood of these individuals and/or DNA extracted from the biopsies. Critical results were confirmed on biopsy DNA.
SNP Identification and Typing
PCR products corresponding to clone end sequences were generated from genomic DNA from the panel of the five individuals of known lactase persistence genotype, and directly sequenced to identify SNPs. These, and two recently published SNPs at –14 kb and –22 kb (Enattah et al. 2002), were typed on further samples using a variety of methods: Restriction Fragment Length Polymorphism (RFLP) analysis, RFLP generated by primer design (Thomas et al. 1999), Tetra ARMS PCR (Ye et al. 2001); and direct sequencing of PCR products. The approximate location of the SNPs was determined using average clone size information and electronically published but unfinished sequence data. The sites of the SNPs in relation to the latest Golden Path sequence, as found on the June 2002 freeze of the Draft Human Genome Browser (http://genome.cse.ucsc.edu/), and the flanking sequences can be found on our web site (http://www.gene.ucl.ac.uk/mucin/). Throughout this study we have interpreted the observed phenotypes at each locus as genotypes, making the assumption that there are no silent/deleted alleles; hence C, CT and T are written and interpreted as CC, CT and TT respectively. The insertion/deletion polymorphism in intron 1 (INDEL intron 1) was typed using two separate PCRs, one for each allele (Figure 1), each reaction including control PCRs of similar size to check for DNA and PCR quality. The primers used for the long allele PCR were: 5′GTGGAATGTGAAACGGATCC3′ (LS-F) and 5′AGGACCATATGGCTGTCTTC3′ (LS-R), product size 244bp, and were both located in the deleted sequence. For the short allele PCR the primers were: 5′CTAGGACATCATAGCTGCCT3′ (SS-F), and 5′CTCTGACTGTGGAAACCACTG3′ (SS-R), product size 944 bp. The longer product size used for the short allele PCR was needed to avoid repeat elements. For both PCRs the conditions were: initial denaturation at 95°C for 5 minutes, followed by 32 cycles of denaturation at 95°C for 30 seconds, annealing at 59°C for 30 seconds, and extension at 72°C for 2 minutes for the short allele PCR, and for 1 min. for the long allele PCR. Details of the primers for the control products included in the PCR are given on our website (http://www.gene.ucl.ac.uk/mucin/).
Haplotypes across the whole region were deduced by segregation, from the CEPH family data in a total of 64 chromosomes by testing sufficient parents and/or children. For the unrelated Finnish chromosomes haplotypes were deduced by using information obtained from the homozygotes, comparison with the CEPH haplotypes, and assuming the minimum number of historic recombination events.
Linkage Disequilibrium Analysis
Pairwise linkage disequilibrium between combinations of the loci was estimated using the 48 CEPH chromosomes of French origin. Linkage disequilibrium was measured using the normalised D coefficient D′, which takes values between 0 for complete equilibrium to 1 for complete disequilibrium. D' was calculated using D′=|Dij/Dmax| where Dij= pij-pipj and Dmax= min(pipj, (1-pi)(1-pj)) if Dij < 0 or min((1-pi)pj, pi(1-pj)) if Dij > 0. pi and pj are the frequencies of alleles i and j, and pij is the frequency of the haplotype having i at the first locus and j at the second locus, using the computer program HaploXT (http://www.sph.umich.edu/csg/abecasis/GOLD/docs/haploxt.html). Haplotypes of the unrelated Finnish individuals were deduced by consideration of the allele combinations found in the homozygous individuals and in the CEPH samples, and assuming the minimum number of ancestral recombinations.
Pairwise association of alleles at each of the SNP loci with lactose digestion status was analysed by Fisher's exact test (1 sided p-values).
Other Statistical Tests
The distribution of activities of the individuals of different genotype was compared using a Students T-test and the Mann-Whitney Rank Test, and the distribution of the activity ratios using the Mann-Whitney Rank Test.
Determination of Allelic Expression
Relative mRNA transcript levels were determined using PCR products obtained by RT-PCR performed under the general conditions described previously (Wang et al. 1994) with appropriate controls and ‘no RT’ blanks. The primers for the RT-PCR are located in exons 1 and 3 of LCT, and 30 cycles of amplification was selected to produce sufficient template DNA for sequencing. The transcripts were distinguished using exonic SNPs at nt 593 and nt 666 and quantification by phosphorimage analysis of 33P sequencing gels, as described previously (Wang et al. 1998).
Construction of Extended Lactase Haplotypes and Association Analyses
Sequence 5′ and 3′ of the lactase gene:
Initially 10 kb 5′ of exon 1 and 2.7 kb 3′ of exon 17 was sequenced using existing clones (Boll et al. 1991). Overlapping primer sets were designed on this sequence and used to produce amplicons from the panel of five samples with known lactase persistence genotypes. Nine new polymorphic sites were detected upstream of the sites we had already reported (Wang et al. 1998). In the five samples all of the polymorphisms were in complete association with alleles at one or other of the three previously published polymorphic sites that define the three major haplotypes (Harvey et al. 1995a). The new sites found are in agreement with those recently reported (Enattah et al. 2002), except that we found an additional site at approximately −7.7 kb. Nine new polymorphic sites were found at the 3′ side of the gene (7 SNPs and 2 microsatellites). Eight of these were also associated with the LCT haplotypes, while one SNP showed a novel allele in panel sample 4. Further analysis of this variant showed that it was unique to this individual.
Contig Construction and Distribution of Polymorphisms and Coding Sequences
A contig of 60 overlapping BAC, PAC fosmid, and cosmid clones was constructed that spanned approximately 1Mb. The map generated is shown in Figure 2 and is in close agreement with that now available in the Draft Genome Browser (June 2002 freeze).
A total of 13 SNPs were identified; they are quite unevenly distributed. Although this is partly due to the distribution of the stretches of sequence screened, to a large measure this reflects their uneven occurrence in the region. In addition to LCT, six other genes were identified in this region. These were the cell cycle gene MCM6 (Harvey et al. 1996), aspartyl tRNA synthetase DARS (Escalante & Yang, 1993), CXCR4 (Wegner et al. 1998), UBXD2 (alias KIAA0242), and R3HDM (alias KIAA0029). This information is in agreement with the June 2002 Draft Human genome browser. Figure 2 shows the locations of genes, and positions of the SNPs.
A ‘first pass’ sequencing of LCT intron 1 was also conducted using clones of B and A haplotype. This revealed a 3.5 kb deletion difference between the two clones. Comparison with the sequence on the June 2002 Draft Human Genome Browser shows that this additional sequence inserts at position 134245902, to give the INDEL intron 1 L allele. Primers were designed to test for this deletion (Figure 1).
Initial Haplotype Analysis
Extended haplotypes of 48 chromosomes were determined from the CEPH families from Northern France. The LCT haplotypes of these chromosomes have been reported previously (Harvey et al. 1995a) and comprise 28 A, 13 B, and 3 C, together with 2 D, 1 E and 1 F. For the purposes of the haplotype analyses; the two D chromosomes were considered as B chromosomes, since D is derived from B by a point mutation at nucleotide –875 (Harvey et al. 1995a; Hollox et al. 2001); and the chromosome F also as B since they differ only at the exon 17 SNP T5579C. To these chromosomes were added chromosomes from 8 individuals from the CEPH Utah families of Northern European ancestry, which added a further 5 C chromosomes, 10 A chromosomes and 1 E. The total number of chromosomes was therefore 64. For this part of the analysis 18 SNPs were tested, numbered 1–18 in Figure 2.
The majority of chromosomes showed the same few haplotypes previously seen over 60 kb, extended over very much longer distances. 36/64 chromosomes were identical by state according to haplotype (A, B or C) from around 420 kb upstream (extending to point 3) to 300 kb downstream of LCT, (extending to point 18 except at positions 4 and 6): SNP polymorphisms at positions 4 and 6 can be seen to be superimposed on the B and A haplotype segments respectively, and probably represent more recent mutations, so these sites were ignored when determining the extended haplotypes, thus making the total number of A, B and C haplotypes extending from points 3 to 18 25/38 for A, 10/16 for B and 1/8 for C). In the case of the C haplotype the ancestral haplotype was deduced by aligning all the 8 chromosomes and assuming that the most frequent distal alleles were part of the haplotype. Along these extended haplotypes the sites can be classed as ‘a’, ‘b’, or ‘c’, according to whether the nucleotide is only present on the A, B or C haplotype (Figure 3). Further upstream of point 3 the haplotypes become more mixed, and at the most 3′ site, site 18, C is present on all the A haplotype chromosomes while several of the B and C haplotype chromosomes are recombined at this point. The two E haplotype chromosomes support the proposed recombinant origin of E haplotype chromosomes (Hollox et al. 2001), the 5′ side representing the C haplotype and the 3′ side the A haplotype. Of the remaining 26 chromosomes, in 16 cases the pattern of alleles either side of the breakpoint was consistent with the idea that a simple recombination event has occurred which swapped the peripheral parts of the other common haplotypes (see our website for supplementary information).
The same full set of polymorphic sites was also tested in a series of 21 samples from unrelated individuals of Finnish ancestry and haplotypes were deduced from analysis of the homozygotes as well as comparison with the CEPH chromosomes, and assuming the minimum number of ancestral recombination events. 20/42 chromosomes showed the same extended haplotypes as in the CEPH individuals (11/23 A, 4/8 B, 5/10 C, IE).
Addition of Polymorphisms that Subdivide, and thus Post-Date, the A Haplotype
The insertion deletion polymorphism in intron 1 (INDEL intron 1) was also analysed in the CEPH families, and full haplotype analysis of this and the newly described –14 kb CT and –22 kb GA SNPs (Enattah et al. 2002) was undertaken for all of the 48 French CEPH chromosomes.
INDEL intron 1: The L allele was carried by all B and C haplotype chromosomes, while most of the A haplotype chromosomes carried S. However, two of the 26 A chromosomes tested in this series carried an L allele.
CT-14 kb GA-22 kb: Similarly, all non-A chromosomes carried –14 kbC and −22 kbG, and all the −14 kb T and −22 kb A alleles were carried on A haplotype chromosomes. The two A haplotype chromosomes that carried the INDEL intron 1 L allele (AL) also carried −14 kbC and −22 kbG, but two other A haplotype chromosomes with intron 1 S (AS) also carried −14 kbC and −22 kbG. In two cases, −14 kbC was present together with –22 kbA on an ‘AS’ haplotype chromosome. Interestingly none of the A haplotype −14 kb C and −22 kb G chromosomes carried A haplotype markers over the full distance between marker 3 and 18, unlike the T and A carrying chromosomes in which 18/22 were on the background of full length A haplotype chromosomes.
The low level of haplotype diversity was reflected in the patterns of linkage disequilibrium. Pairwise association of alleles and linkage disequilibrium (D′) were estimated between all of the informative sites, using the 48 CEPH chromosomes of French origin (Figure 4). It can be seen that there is significant LD across the entire region, there even being statistically significant LD (p = 0.01) between the first and last sites, which are separated by over one megabase. LD was not determined in the Finnish cohort since this population was selected artificially to contain more lactase non-persistent people than in the general population.
Allelic Association with Lactose Digestion (Lactase Persistence) Phenotype
For this part of the study we tested all 41 phenotyped Finnish individuals, but focussed on markers which discriminate A and non-A chromosomes, although all the haplotype defining markers within the gene were also tested. Three additional polymorphic sites located between –7.5 kb and –8.5 kb were also tested, since analysis of the original panel of five individuals had indicated that these also discriminate A and non-A chromosomes.
Each of the sites was assessed in relation to lactose digestion phenotype; phenotype was highly associated with several of the markers (Table 1). Consistent with observations on the extended haplotypes, statistically significant association can be seen between lactose digestion and site 3, which is located about 420 kb upstream of LCT(p = 0.00019). The highest association was however with the –14 kb CT SNP, in agreement with the recently reported results, but one individual gave a discrepant result, an apparent digester who does not carry a T at –14 kb (or a G at –22 kb). Examination of the detailed tolerance results from this individual showed conflicting results for one of the tests (blood glucose and urinary galactose both showed levels clearly in the ‘digester’ range, while the high breath hydrogen indicated maldigestion). The –22 kb marker was also very highly associated. In all but one case there was concordance between CT −14 kb and GA-22 kb. This one discordant individual (–14 kb and –22 kb GA) was non-persistent (based on three tests as well as presence of symptoms).
Table 1. Pairwise association in a cohort of Finnish individuals, between lactose digestion status and allele counts in each group. Markers selected were those which show strong association with the A haplotype, except for markers 10, 13 and 11 which are defining markers for the B (b) and C (c) haplotypes. In the case of the A-associated markers the number of individuals who are ‘non-fits’- i.e. who carry the allele but are non-digesters, or do not carry the allele and are digesters - are also shown. The markers that were originally part of the 60 kb haplotype are indicated with an asterisk
Allele frequency of rarer allele in sample tested
Number of individuals tested (Number of digesters)
Number of ‘non fits’(Allele+/non digester; Allele-/ digester)
Fishers exact p value
9 (8; 1)
1.9 × 10−4
14 (13; 1)
8.2 × 10−3
2 (1; 1)
1 (0; 1)
11 (11; 0)
1.8 × 10−4
11 (11; 0)
5 × 10−5
13 (13; 0)
1.5 × 10−4
8.2 × 10−3
15 (15; 0)
8 × 10−5
12 (12; 0)
2 × 10−5
nt 666 (cDNA)
nt 5579 (cDNA)
16 (16; 0)
5 × 10 − 4
Examination of the 82 putative Finnish haplotypes reveals that 12 of the 22 A haplotype chromosomes which carry the C allele at –14 kb show non A haplotype markers immediately upstream of –14 kb. In five cases non A haplotype markers were also found downstream of –14 kb, suggesting that the surrounding chromosomal fragment is of non A haplotype origin. Four of these five chromosomes carried the ancestral L allele in intron 1. In just 6/22 cases the C allele was found on a fully extended A haplotype chromosome. This contrasts with the finding that 19/21 T carrying chromosomes were on the fully extended haplotype background, and is consistent with the observation in the CEPH families of shorter blocks of A haplotype in –14 kbC carrying chromosomes, and longer blocks in –14 kbT carrying chromosomes.
Analysis of CT-14 kb and GA-22 kb in a Series of 48 Individuals Phenotyped and Genotyped for Lactase Persistence
In previous studies we characterised the level of lactase and lactase mRNA transcripts in duodenal biopsies. The level of lactase activity measured as a ratio of sucrase to lactase (S/L) showed the trimodal distribution that had originally been part of the evidence that the polymorphism was cis-acting (Flatz, 1984; Ho et al. 1982; Wang et al. 1995) (Figure 5). Relative transcript levels were used to determine the lactase persistent genotype of all 17 individuals who were heterozygous for one or more SNPs (Wang et al. 1995). Five of these genotyped individuals (whose DNA had been used for SNP identification) were tested for the full range of markers, and their extended haplotypes can be deduced (Figure 3a). For all other individuals, only the two most highly associated SNPs (CT-14 kb and GA-22 kb) were tested, as well as SNPs across LCT, which were used to deduce the likely 60 kb LCT haplotypes. The results were assessed in relation to lactase persistence status as well as S/L ratio. The –14 kbT allele and the –22 kbA allele were found in one or two copies in all the persistent individuals (31/36 of whom were of UK ancestry), and were not present in any of the non-persistent individuals nor in the one person for whom the diagnosis was ambiguous (Table 2) (all but one of non-UK ancestry) (Harvey et al. 1995b; 1998). All non-A haplotype chromosomes carried a C at −14 kb and a G at –22 kb, and all –14 kbT alleles and –22 kbA alleles were associated with A haplotype chromosomes. All 8 individuals who had been shown to be heterozygous for lactase persistence by transcript expression were heterozygous for both SNPs.
Table 2. Lactase and sucrase activities in the series of London patients, grouped according to persistence and SNP status. Monoallelic or biallelic expression was used to interpret heterozygosity or homozygosity of the phenotypic trait of lactase persistence
SNP and Persistence status
1Activities per min. per gram wet tissue. Note that protein determinations were not done, in order to conserve sufficient biopsy material for immunohistology, electrophoresis and RNA studies, and this probably accounts for the wider scatter of the data when expressed this way. 2p = <.03, Mann-Whitney and t test. 3p = 0.006, 4p = 0.06, Mann Whitney test. There was no significant difference in the sucrase activities between any of the groups.
CC uncertain status
CT persistent known heterozygote
CT persistent unknown status
CT persistent homozygote
CT persistent, AA at –22 kb
However, comparison of the CT -14 kb and GA –22 kb results with the previously determined trimodal distribution of enzyme activity ratios gave an unexpected distribution. If the T allele is truly causal of high lactase expression in adult life, it would be expected that most of the homozygotes would fall in the low S/L ratio peak, while most of the CT heterozygotes would be in the intermediate peak, as had been observed previously for heterozygotes diagnosed from transcript expression studies (Wang et al. 1995). However, the enzyme activity ratios of the total CT heterozygote population overlapped dramatically with the TT population, and the overall difference between the two groups was not statistically significant (p = 0.085). This can be seen clearly in Figure 5. When individuals heterozygous for the –14 kb CT SNP were grouped according to known and unknown lactase persistence genotype status, based on RNA studies, the two groups differed from each other and it was clear that the CT ‘unknown status’ group did not show evidence of higher ratios (p = 0.76, Mann-Whitney) while the CT ‘known status’ group was significantly different from the TT homozygotes (p = 0.006, Mann-Whitney). The difference between the two CT groups was confirmed as being due to differences in the lactase rather than the sucrase, when the activities were compared separately (Table 2). Interestingly, the two lactase-persistent individuals who were –14 kb CT and –22 kb AA both showed the higher lactase activity (lower ratios) predicted for persistent homozygotes (Table 2, Figure 5).
The second very clear observation was that the one individual who was heterozygous AB haplotype and homozygous for lactase persistence, as assessed by transcript expression, was heterozygous for both CT and GA (Figure 5).
Allelic Imbalance Studies
Our original observation of balanced biallelic expression of lactase mRNA transcripts, in the UK adult whom we had interpreted as homozygous persistent, was made using exons 2 and 17, by visual inspection of gels. This evaluation was therefore reassessed for this study, using a new set of PCR primers to retest the exon 2 SNP (nt 666) and testing a third SNP in exon 1 at nt 563, which was included in the same PCR product. We also quantified the allelic expression by Phosphorimage analysis. Figure 6 shows the results of this experiment, and confirms that this individual shows high expression of the B transcript that carries a C at –14 kb and G at –22 kb, and that the high expression of this transcript cannot therefore be due to alleles at either of these sites.
In this study we have shown that LCT is located within a region of very high linkage disequilibrium, and many polymorphic sites show a high level of association with lactase persistence/lactose digestion. However, the association with CT-14 kb and GA-22 kb is high enough to consider them as very serious candidates for the causal mutational change. The SNPs at –14 kb and –22 kb clearly resulted from mutations on an A haplotype chromosome, C and G representing the ancestral alleles. The pattern of association of these two loci would suggest that the CT-14 kb SNP is due to a more recent mutation than GA-22 kb, and that they both occurred after the deletion event in intron 1. The shorter blocks of haplotype identity seen in the ancestral A haplotype chromosomes is consistent with the much greater relative age of the 60 kb A haplotype than that of the –14 kbT mutation.
Our analysis of the Finnish cohort, which shows 98% association of lactose digestion with T –14 kb, is consistent with this being causal of lactase persistence since careful examination of the different lactose digestion measures highlights the difficulties inherent in making this diagnosis. The one discrepant individual would have been differently classified (as a non-digester) if breath hydrogen alone had been used. It is however noteworthy that if the data are re-analysed using breath hydrogen alone, one other different individual becomes an exception. In contrast, the one additional exception with the GA–22 kb SNP tends to support the notion that this is not the causal change. This individual is unambiguously lactase non-persistent but carries an A allele, and is in agreement with the discrepant samples described by Enattah and colleagues (Enattah et al. 2002).
The data obtained from duodenal biopsies from individuals of non-Finnish origin also indicate that the T allele of the CT–14 kb SNP is highly associated with persistence in non-Finnish Europeans, and in all cases the data suggest that it occurs on the background of the A haplotype. The persistent samples we tested came mainly from Northern Europe, though 3 were from Southern Europe and 2 from the Indian sub-continent. However, the lactase activities do not correspond well with predicted genotype, except for those CT heterozygotes already known to be heterozygous for expression level who show intermediate activities. This indicates that the association of –14 kbT with expression level is not as high as in the Finnish population. If the T allele is causal of lactase persistence, it would appear that there is additional unseen heterogeneity. This could for example include the SNP at –958, which we have previously shown to affect DNA binding and which has been shown by others to alter LCT promoter function (Chitkara et al. 2001; Hollox et al. 1999).
Furthermore, we have previously shown that persistence is occasionally found on non-A haplotype chromosomes (Harvey et al. 1998). If presence of a T allele at the CT–14 kb SNP is causal of persistence, the most likely explanation for this would have been the occurrence of an ancestral recombination between this position and LCT. However, the high expressing B haplotype chromosome studied here in detail did not show evidence of any recombination in >420 kb upstream of LCT and importantly did not carry the T allele. Preliminary analysis of 36 samples from unrelated Italians described previously (Harvey et al. 1998) shows that while all 25 of the definite non-persistent individuals are homozygous CC, there are two other cases in which the T allele was not present on high expressing non-A haplotype chromosomes. Full analysis of the haplotype background of these samples is in progress.
In conclusion, if lactase persistence/non-persistence is due to a simple biallelic polymorphism, as originally thought, our data might suggest that another change is responsible for lactase persistence, and that this arose on an A haplotype chromosome after GA-22 kb and before CT-14 kb. Analysis of both the CEPH families and the Finnish samples shows that the three common European SNP haplotypes extend some 1000 kb in many Northern Europeans. This contrasts with the report of Enattah and colleagues (Enattah et al. 2002), who showed a 200 kb region of LD and whose ‘identity by state’ microsatellite analysis indicated a shared haplotype of only 47 kb carrying the lactase persistence allele, the region which they fully sequenced in genotyped individuals. The small size of this region may in part reflect the higher mutation rate of microsatellites, rather than ancient recombinations as suggested. For example, in their study only one microsatellite D2S3013 (see Figure 1 & 2) located within intron 1 of LCT breaks down the haplotypes of the persistent chromosomes at the 3′ end of the region, while two more 3′ microsatellites show the same allele in all persistent chromosomes tested. Microsatellites that we have tested in the immediate vicinity of the gene show remarkably little more diversity than the SNPs (Hollox, Mulcare & Thomas, unpublished) and D2S3013 may simply have a higher mutation rate. This interpretation means that the causal element could reside far outside the 47 kb region.
However, the other possibility is that the C-14 kbT mutation is indeed causal, but is not the only cause of lactase persistence. The fact that the B haplotype chromosome with high lactase expression studied here in detail shows an extended B haplotype over more than 700 kb (sites 3–15) might point to a separate mutation with similar effect. It is also possible that this second genetic change resides in a trans-acting element, such as the gene encoding a putative transcription factor that binds to the –14 kb site, or even that a non-genetic cause has unusually allowed adult expression of this allele. In due course these possibilities should be resolved by functional studies, but this report illustrates the difficulty of linkage disequilibrium mapping of functional nucleotide changes in regions of high LD, where the phenotype is not trivial to determine, even when there is clear evidence of a monogenic trait.
Another important conclusion from this work is that the very extended LCT haplotypes (particularly the A haplotype carrying the T allele at -14 kb), and the exceptionally long region of LD, are consistent with the notion of recent selection (Sabeti et al. 2002; Slatkin, 2000; Slatkin & Bertorelle, 2001) and also with our previous population genetic analysis of a smaller LCT haplotype, which suggested a model of recent directional selection for lactase persistence (Hollox et al. 2001).
The authors are grateful to Nabila Mahfiche, Janki Shah, Fella Hammachi, Lupe Polanco and Helen Bond for their contributions to this work, to the Rank Prize summer studentship funds for supporting NM, FH and HB, to Sue Povey and Mark Thomas for helpful discussion. This work was funded by the Medical Research Council through the MRC Human Biochemical Genetics Unit and the Digestive Disorders Foundation (CBH). CM is funded by a BBSRC CASE studentship.