Patterns of variation in DNA segments upstream of transcription start sites

It is likely that evolutionary differences among species are driven by sequence changes in regulatory regions. Likewise, polymorphisms in the promoter regions may be responsible for interindividual differences at the level of populations. We present an unbiased survey of genetic variation in 2-kb segments upstream of the transcription start sites of 28 protein-coding genes, characterized in five population groups of different geographic origin. On average, we found 9.1 polymorphisms and 8.8 haplotypes per segment with corresponding nucleotide and haplotype diversities of 0.082% and 58%, respectively. We characterized these segments through different summary statistics, Hardy-Weinberg equilibria fixation index (Fst) estimates, and neutrality tests, as well as by analyzing the distributions of haplotype allelic classes, introduced here to assess the departure from neutrality and examined by coalescent simulations under a simple population model, assuming recombinations or different demography. Our results suggest that genetic diversity in some of these regions could have been shaped by purifying selection and driven by adaptive changes in the other, thus explaining the relatively large variance in the corresponding genetic diversity indices loci. However, some of these effects could be also due to linkage with surrounding sequences, and the neutralists' explanations cannot be ruled out given uncertainty in the underlying demographic histories and the possibility of random effects due to the small size of the studied segments. Hum Mutat 28(5), 441–450, 2007. © 2007 Wiley-Liss, Inc.


INTRODUCTION
Patterns of DNA diversity in the human genome result from the stochastic nature of mutations and recombinations. They are further shaped by random genetic drift, by demographic history, and by natural selection. To understand the distinct contributions of the underlying genetic and evolutionary phenomena, we have to examine the genomic variation in individuals and populations from different geographic areas. So far, considerable effort has gone into the identification of polymorphisms which could be used as markers in linkage and association studies. Fewer studies have focused on DNA variability in particular genomic segments, and even fewer have investigated variation beyond two or three population groups. Likewise, much less attention has been given to DNA variants that could be classified as regulatory, in spite of their importance for understanding species evolution [King and Wilson, 1975], functional components of the genome [Carroll, 2005], and ultimately as phenotypic diversity and complex traits [Pastinen and Hudson, 2004]. On the other hand, there is an inherent difficulty in defining DNA segments and their variants that can be qualified as regulatory-in contrast, for example, to easily qualifiable alterations in the protein-coding sequences. In order to describe genetic diversity in cis-regulatory regions and, in particular, DNA variation involved in the control of transcription of the protein-coding genes, we used an operational definition of the promoter region as a 2-kb segment upstream of the transcription start site [Kim et al., 2005]. Previously, we described DNA polymorphisms in 197 such regions surveyed for the presence of variants in a sample of 40 individuals representing five population groups distributed worldwide . These loci were primarily chosen as candidate genes for their possible involvement in cancer [Belanger et al., 2005], drug response, inflammation, and/or displaying allelic imbalance [Pastinen et al., 2004]. In this article, we present a detailed characterization of genetic diversity in upstream regions of a subset of 28 of these genes, and we describe the corresponding allelic and haplotypic frequencies as well as their geographic distributions. The resulting diversity patterns portray different landscapes of genomic variability, suggesting as well that natural selection could have influenced the evolution of some of these segments.

Polymorphisms and Genotyping
Polymorphisms were detected by dHPLC (WAVE System of Transgenomic; Transition Technologies Inc., Toronto, Ontario, Canada) and sequencing as described . Briefly, typically seven amplicons of $300 bp were designed (Primer 3 Software [Rozen and Skaletsky, 2000]) to cover the 2-kb segment upstream of the transcription start site. DNA fragments were amplified by standard PCR, using 10 ng of genomic template, and the products were analyzed by dHPLC at a minimum of two different temperatures. The purported heteroduplexes were subsequently sequenced (Applied Biosystems 3730 xl DNA Analyzer; Applied Biosystems, Foster City, CA) to identify the underlying variants. Allelic states of substitution and small insertion/deletion polymorphisms in the human and in the great apes samples were determined by dynamic ASO hybridization as described [Bourgeois and Labuda, 2004], using the ASO probes listed in Supplementary Table S2. To avoid transcription errors, the genotypes were read and entered into our database twice by two independent individuals, and only concordant readings were accepted.
Tajima's D test statistics [Tajima, 1989] considers the difference between Y S and Y p , normalized by the expected standard deviation of this difference. The first estimator, Y S 5 S= P nÀ1 i 1=i, where n corresponds to the number of sampled chromosomes [Watterson, 1975], is only influenced by the number of segregating sites, S. The second estimator, Y p , representing the mean number of pairwise differences between individual sequences [Tajima, 1989], corresponds to the product of nucleotide diversity and the sequence length L such that Y p 5 pL. The number of sites, S 1 , i.e., the number of sites with a derived allele observed only once (i 5 1), provides the estimator Y F 5 S 1 [Fu and Li, 1993]. Y F is used in conjunction with either Y S or Y p to calculate test statistics D Fu&Li and F, respectively, as well as related statistics D Ã and F Ã [Fu and Li, 1993]. The estimator Y H , by Fay and Wu, corresponds to the sum of twice the squared frequency of the derived alleles [Fay and Wu, 2000]. The corresponding H-statistics comparing Y p with Y H are particularly sensitive to selection sweeps and the effect of genetic hitchhiking [Maynard Smith and Haigh, 1974], whereby a positively selected variant simultaneously drives up the population frequency of its neighboring tightly linked alleles. The test statistics Fs [Fu, 1997] examine the difference in the number of the observed haplotypes with those expected given Y p , thus confronting the latter with Y k . Tests by Ewens-Watterson and Chakraborty [Chakraborty, 1990;Ewens, 1972;Watterson, 1978] based on the infinite allele (haplotype) model can be considered to some extent complementary; they confront Y k , estimated from the number of haplotypes with Y hom , estimated from the haplotype homozygosity, hom 5 1-G. Evaluating the significance of the Ewens-Watterson test, Arlequin implements the protocol described by Watterson [1978] and that by Slatkin [1994]. All neutrality tests were carried out using data obtained with an initially ascertained sample of 80 chromosomes, except in the Fay and Wu test, where an extended sample of 80 genotyped individuals was used. The Hudson-Kreitman-Aguadé (HKA) test [Hudson et al. 1987]-considering the number of segregating sites, S, as well as nucleotide diversity, p, as a measure of locus diversity-was executed as previously described [Jaruzelska et al., 1999]. The reported Fsts and the results of the HWE test were obtained using genotypes of the extended sample of 80 individuals. Correction for multiple testing (neutrality two-tailed and HWE tests) was carried out according to Storey [2002] using a false discovery rate of 10% (http://faculty.washington.edu/jstorey/qvalue/index.html).
Divergence, d, between human and chimpanzee sequences was estimated by counting differences between the human sequence (Supplementary Table S1) and the chimpanzee sequence (November 2003 Assembly UCSC Browser [http://genome.ucsc. edu-cgi-bin/hgGateway]), initially at the overlapping length with the analyzed 2-kb human segment and subsequently extended to about 12 kb by adding 5 kb on each side to decrease variance in the d estimate. Both estimates largely agreed, and the latter is reported (Table 1). Prism v. 4.03 (GraphPad Software Inc., San Diego, CA), Excel (Microsoft, Redmond, WA), and Statistica v. 7.1 (StatSoft Inc., Tulsa, OK) were used to evaluate distributions and in correlation analyses.
Coalescent simulations [Hudson, 1990] were performed under a selectively neutral model using the Cosi program of Schaffner et al. [2005] (www.broad.mit.edu/personal/sfs/cosi). We carried out these simulations using mutation rates defined by the average polymorphic site's density and N 5 10,000 to obtain distributions of: 1) the number of haplotypes carrying distinct numbers (m 5 0, 1, 2, etc.) of new alleles; 2) the frequencies of the ancestral haplotypes; and 3) the major haplotypes among different classes of haplotypes defined by the number of new alleles, m, that they carry. In addition to simulations under the standard model without recombination, we studied the effect of recombinations, as well as demographic expansion and/or population bottleneck as used by Akey et al. [2004].

Diversity Data
In a group of 28 protein-coding genes (Supplementary Table  S1), we characterized genetic variations in their promoter regions, defined arbitrarily as 2-kb segments upstream of the transcription start sites [see Kim et al., 2005]. Transcriptional start sites were defined based on the mRNA sequence versions (REFSEQ) listed in Supplementary Table S1. DNA polymorphisms were first ascertained by dHLPC, combined with DNA sequencing, in a panel of 40 genomic samples representing individuals of sub-Saharan African, Native American, European, Middle Eastern, and Southeast/East Asian descent . All variants reported here were genotyped and thus independently reconfirmed by ASO hybridization in the same panel of 40 individuals. We observed 254 simple polymorphisms, 243 substitutions, and 11 indels. The sequence contexts of each of the reported polymorphisms, as well as their ''rs'' identifiers, are provided in Supplementary Table S2, while the corresponding reconstructed haplotypes (see Materials and Methods) are listed in Supplementary Table S3. HWE was primarily tested as an additional means of checking the quality and consistency of our genotypes [e.g., Fan et al. 2002]. As a result, and after correcting for multiple testing [Storey, 2002] (see Materials and Methods), we found three polymorphisms in the CX3CR1 segment showing significant departure from the HWE (see below).
Different diversity indices and other characteristics of the analyzed loci are presented in Table 1, which also includes per segment Fst values for the whole population sample and considering only non-African groups. Geographic distribution of haplotypes is given within haplotype networks shown in Supplementary Figure S1. On average, we observed nine segregating sites (S 5 9.174.6) and a similar number of haplotypes  (k 5 8.874.2) per segment (average length of 18547329). Mean nucleotide diversity and haplotype diversity are p(%) 5 0.08270.056 and G 5 0.5870.23, respectively. The associated standard deviations manifest a large variance between individual loci. The extent of this variance may be due to: 1) stochastic effects resulting from different genealogical histories of each of these segments; 2) the effect of selection; or 3) heterogeneity in the mutation rate among loci [Chuang and Li, 2004;Matassi et al., 1999]. However, we did not notice any systematic correlation (r 2 5 0.015) between S and nucleotide divergence d, measured as a proportion of fixed sites between humans and chimpanzees (Table 1). Thus, variation in the mutation rate alone cannot explain differences between the polymorphic content in the analyzed segments. It could possibly be due to differences in the underlying genealogies, reflecting either variation in demographic history or the effect of natural selection. Examination for these effects requires the analysis of diversity indices that capture different aspects of the data.
In Table 1, we compared different estimators of the population mutation parameter Y 5 4Nm, where N denotes the effective population size and m, the mutation rate per segment per generation. Watterson's estimator, Y S [Watterson, 1975], and Tajima's Y p [Tajima, 1989], based on the infinite sites model, can be derived from S and p, respectively. In turn, Y k [Ewens, 1972] and Y hom [Chakraborty, 1990], estimated from k and G, originate in the infinite alleles model (see Materials and Methods). Note that the term ''allele'' is reserved here to variants of a single segregating site as in the infinite sites model and, in the infinite alleles model, should be replaced by ''haplotype'' (i.e., variant of the whole locus). Two additional estimators, Y F by Fu and Li [1993] and Y H by Fay and Wu [2000], require the knowledge of the derived and the ancestral state at each of the segregating sites, here obtained by genotyping chimpanzee and other ape DNAs (see Materials and Methods). Knowing the ancestral allele at each polymorphic site, we introduced an additional descriptor of locus diversity, a haplotype allelic class describing the number of mutational steps separating each haplotype from the ancestral haplotype (i.e., entirely composed of ancestral alleles). Table 1 lists haplotype allelic class M for each of the major haplotypes. The major, i.e., most frequent, haplotype in 12 of these segments is the ancestral one (i.e., M 5 0). In contrast, we observed MZ5 in five of these (see the correlation with corresponding Y H s). The fourgamete test indicated that recombinations contributed to haplotype diversity in 10 of the segments. Yet, as indicated in Table 1, in four of them the impact of these events was relatively small, with only one additional recombinant haplotype observed (see the corresponding haplotype networks in Supplementary  Fig. S1).

NeutralityTests
We carried out neutrality tests that confront different estimates of Y and/or summary statistics, such as the number of haplotypes or their homozygosity (Table 1). Assuming a simple model of a population at constant size, mutational equilibrium, and neutrality, these different estimates are expected to be the same or to agree, given the associated variance (see Materials and Methods). An opposite result indicates departure from this model, suggesting selection or less simple demography, i.e., effects due to population growth, bottleneck, or population structure. Table 2 lists all segments highlighted by these tests; the results that remain significant after correcting for multiple testing [Storey, 2002] using a false discovery rate of 10% are shown in bold.
The testing of data in the framework of the infinite sites model can be illustrated by a histogram of allelic frequency classes that regroup sites with the same number of the derived allele, from i 5 1, 2, 3,yto i 5 n-1, where n is the number of chromosomes in the sample. The expected distribution is S i (i) 5 Y/i [Fan et al., 2002;Fu, 1997] where s S i 5 S, as illustrated in the left panels of Figure 1, where Y p estimates (Table 1) were used to trace the theoretical curve according to the above equation. The corresponding plots for other segments than the three shown in Figure 1, either highlighted by neutrality tests or singled out by Fst statistics, can be found in Supplementary Figure S2. The histogram of allelic frequency classes in Figure 1 shows an excess of low-frequency polymorphisms in the case of CDC25A, as revealed by the negative Tajima's D in this segment (Table 2); it shows a good concordance between theoretical distribution and the data in the CX3XR1 segment and a marked excess of highfrequency-derived alleles in the case of GSTM3. The latter agrees with the result of the Fay and Wu test for this segment (Table 2). Middle histograms in Figure 1 illustrate the results of the haplotype-based tests. In the case of CX3CR1, as for allelic frequency classes, this plot shows an excellent fit between the theoretical distribution and the observed frequencies. In these representations illustrating the results of the neutrality test from Table 2, the CX3CR1 segment appears to conform to a simple neutral model. In contrast, in CDC25A given the number in the expected frequencies do not match with the data. There is an excess of the observed haplotypes, given their homozygosity (1-G). This discordant distribution, in the case of the CDC25A segment, reflects significant results of haplotype-based tests, including Fu's Fs test, which, however, compares k with its estimate based on Y p rather than Y hom . After correcting for multiple testing, no segment remained significant for the Ewens-Watterson test as well as for the Fay and Wu test. Furthermore, the significant results of Chakraborty's test for HTR2A and GPX2, as well as those of Fu's Fs test for the GPX2, can likely be ascribed to the effect of recombinations. The latter, causing the number of the observed haplotypes to increase faster than they would simply due to mutation alone, can render the results of the above tests falsely significant. Yet, at the same time, the presence of recombinations renders other tests, such as Tajima's or Fay and Wu's, less conservative, i.e., ''more significant'' [Fay and Wu, 2000]. Indeed, considering the effect of recombinations (three-to six-fold genomic average) in six segments where more than one recombinant haplotype was observed (Table 1), GPX3 stayed significant for the Fay and Wu test after the correction for multiple testing.

Haplotype Allelic Classes
While the left and middle histograms in Figure 1 represented the allelic and haplotype configurations of the analyzed segments, the histograms shown in the right column combine the allelic and haplotypic information in a single plot to reveal additional characteristics of the data. The haplotype allelic classes A represent all haplotypes carrying the same number of new alleles m such that, for example, A m¼0 represents the number of the ancestral haplotypes in the sample; A 1 , the number of all haplotypes with one derived allele; A 2 , with two derived alleles, etc., such that SA i 5 n and not as k i opposite to SS i 5 S. Note that we have already introduced this notion to describe the allelic class m 5 M of the major haplotype (Table 1). In Figure 1, the theoretical distributions of the expected counts of haplotypes in each of their allelic classes m were computed using coalescent simulations under a simple population model. The plot of haplotype allelic classes in CDC25A summarizes well the characteristics of this segment revealed by the negative Tajima's and Fu Fs statistics as well as by Chakraborty's test. Due to the excess of haplotypes originating in polymorphisms with low counts of the derived alleles, the distribution of haplotype allelic classes is skewed to the left as compared to the simulated curve. CDC25A is representative of a group of loci sharing similar characteristics (Table 2), namely CDKN2A, MICA, SKP2, SMAD4, and, to a much lesser extent, HDAC1 ( Supplementary  Fig. S2).
On the other side of the spectrum are the segments with their major haplotypes removed by several mutations from the ancestral one, MZ3 (Table 1). As in CX3CR1 and GSTM3, in Figure 1 this time the observed distributions are skewed to the right relative to the simulated curve. As shown in Supplementary Figure S2, this group includes all segments already singled out by the Fay and Wu test and by the positive Fu and Li statistics (Table 2), i.e., BTN3A2, GPX3, GSTM3, GSTP1, and HTR2A, and in addition CX3CR1, which failed all these tests. Because recombinations were observed in some of these segments, we also simulated distributions of haplotype allelic classes assuming recombination at a rate 10 times the genomic average. It turned out that simulated distributions were relatively insensitive to crossovers, such that the observed distribution cannot be explained by the effect of recombinations at such intensity (dashed line in Fig. 1 for CX3CR1; see also Supplementary Fig. S2).

Population Variance: Fst
Over time, populations differentiate in allele frequencies, and the resulting geographic partitioning of this diversity can be measured by Fst [Wright, 1951]. We estimated fixation index (Fst) values [Weir and Cockerham, 1984] for each segment based on the contributing sites and for each polymorphic site separately. Over the segments, the average Fst was 6.675.4% in the total sample composed of five population groups, and 4.374.5% when four non-African population groups were considered (Table 1). In turn, when 254 polymorphisms were considered individually, the average Fst was 4.675.8% for all populations and 3.175.1% considering only four non-African groups. The individual Fst values can be evaluated by comparison to their empirical distributions obtained from larger data sets [Akey et al., 2002;Excoffier, 2005;Fullerton et al., 2002], and this approach was used to single out promoter polymorphisms that presumably evolved under natural selection Rockman et al., 2003Rockman et al., , 2004Wang et al., 2005]. As a reference, we used our set of the 254 Fst values for polymorphisms investigated here (Supplementary Fig. S3). Several sites at the edge of the distribution exceeded the arbitrary threshold of the average Fst plus 2 SDs, either in the total five-population sample and/or in the non-African fourpopulation sample (Supplementary Table S4). Among them, we find three sites in CX3CR1-rs2669846:G4T, rs11715522:A4C, rs11917223:C4G-that are characterized by relatively elevated (30-50%) new allele frequencies. Interestingly, the same sites were not in the Hardy-Weinberg equilibrium (w 2 5 17.9, po4 Â 10 À5 ; w 2 5 15.3, po0.0001; w 2 5 8.3, po0.005, in the world sample, and w 2 5 11.5, po0.0009; w 2 5 10.6, po0.005; w 2 5 9.6, po0.005, in the non-African sample, respectively). Note that no other sites, with low or high Fst, showed any departure from the HWE, suggesting that this represents a specific effect of the CX3CR1 locus. Therefore, the Hardy-Weinberg disequilibrium together with high Fsts can be taken as evidence for evolutionary forces other than random drift acting upon this locus. On the other side, the Fst distribution, there are sites with zero or near zero Fsts. If the data are informative, this might suggest the effect of evolutionary forces countering random drift in order to maintain allele frequencies at similar levels across populations (e.g., BTN3A2; Table 1). We note that the average per site Fst for all polymorphisms examined in this study was 4.6%, which is less than the corresponding values of 10 to 15% reported in the literature for a variety of genetic markers [Bowcock et al., 1991;Jorde et al., 2000;Akey et al., 2002]. This difference can be at least partly accounted for by the ascertainment bias and high average per site heterozygosity of the classical, early RFLP as well as Alu markers [Bowcock et al., 1991;Jorde et al., 2000]. Here, in contrast, all polymorphic sites are considered, including those of low minor allele frequency [Ronald and Akey, 2005]. However, a similarly obtained (resequencing) set of 297 polymorphisms from the expressed sequence tags [Fan et al., 2002] shows twice as high an average Fst when compared with our data (9.2712.7% for the total sample and 6.5710.4% when Africans were excluded). Yet the shape of the Fst distribution is virtually identical in both data sets ( Supplementary Fig. S3), raising the question whether the low average Fst observed here reflects its overall depression upstream of the protein-coding genes or only represents an artifact of a particular configuration of our population samples.

Haplotype Allelic Classes and the Prevalence of Major Haplotypes
In our data, we found two opposite variation patterns at the level of haplotypes (see haplotype networks in Supplementary  Fig. S1 and the list of haplotypes in Supplementary Table S3). The first pattern is characterized by the dominant presence of the ancestral haplotype (all of whose alleles are ancestral), whereas the second includes loci where the ancestral haplotype is absent or is present only at residual frequencies. These two opposite diversity profiles can be contrasted during analysis of all segments together. In Figure 2A, we show the histogram of the ancestral haplotype frequencies in our sample of 28 segments, counting loci with no ancestral haplotype and the number of those with the ancestral FIGURE 1. Distributions of allelic frequency classes (left panels) of frequencies of haplotypes [Middleton et al., 1993] and haplotype allelic classes (right) in CDC25A, CX3CR1, and GSRM3. Bars represent the observed values; lines represent theoretical distributions. The occupancy of allelic frequency classes corresponds to counts of sites represented by i new alleles in a sample of n chromosomes (i 51, 2, 3,y, n^1). Here, the theoretical curve (solid line) corresponds to the distribution calculated from the equation [Fan et al., 2002;Fu,1997] S i (i) 5 Y/i, using Y/ p (Table 1) as the estimator of Y.The theoretical distribution (solid line) of haplotype frequencies expected given k observed haplotypes (Table 1) is according to Ewens [1972]. Haplotype names are arbitrary and correspond to their names in our database. In the case of haplotype allelic classes, regrouping haplotypes sharing the same number of mutations from the ancestral haplotype, their theoretical occupancy was obtained by coalescent simulation under the standard model, assuming constant population size without (solid line) and with (dotted line) recombination, at 10-fold the genomic average in the case of segments where crossovers were detected.
haplotype falling within each of the four frequency quartiles. The frequency ''0'' and the first quartile [0-0.25] represent 15 loci in a category of segments lacking or with ancestral haplotypes at minor frequencies. The three other quartiles [0.25-1.0] include 12 loci where the ancestral represents the major haplotype and CDKN1B segment and where two haplotypes, the ancestral one and the one carrying one derived allele, happened to have equal frequencies (M 5 0/1 in Table 1). Coalescent simulations were used to compare these data with theoretical distributions expected under a simple population model [Akey et al., 2004;Hudson, 1990;Schaffner et al., 2005]. We evaluated the expected distribution in 28 sequence segments, given the average density of polymorphic sites in this data set. Simulations were carried out under the constant population size in the absence and in the presence of recombinations (10-fold the genomic average). It turns out that in 28 segments analyzed we observe ancestral haplotypes more often than predicted under this model and that these ancestral haplotypes tend to occur at frequencies higher than expected. They prevail in the uppermost frequency quartile (Fig. 2A). The difference between the observed and the expected is even more acute in the presence of recombinations. On the other hand, to visibly affect the distribution, the recombinations would have to occur at a rate well above the genomic average.
In Figure 2B, the histogram of the partition of major haplotypes among haplotype allelic classes (i.e., M from Table 1) shows a U-like skewed distribution, as in Figure 2A. We observe an excess of the ancestral haplotypes (on the left) and of haplotypes carrying five or six new alleles, on the right of the distribution. The first effect becomes more pronounced in the presence of recombinations, while the second becomes attenuated but does not disappear in their presence. Our simulations also examined different demographic scenarios, such as those described by Akey et al. [2004], showing that neither population growth nor population bottlenecks substantially affected the simulated distributions with respect to the simple population model (Supplementary Fig. S4). In other words, among our loci there are more ancestral haplotypes that are major haplotypes than would be expected under the simple neutral model. At the same time, we observe a surplus of haplotypes being major and carrying excessive numbers of the derived alleles, consistent with the results of the statistical tests from Table 2. In summary, the analysis presented above suggests that in the investigated set of segments, certain loci appear as evolutionarily conserved while others seem to be more evolved relative to the average.

DISCUSSION
We have analyzed 28 genomic segments located upstream from the transcription start sites of the protein-coding genes, where transcription control elements usually reside. The overall pattern of the observed diversity, both qualitatively and quantitatively, did not differ from the genomic average of nonexonic DNA segments. Nucleotide diversity of 0.082% was similar to that observed by others in noncoding sequences [Livingston et al., 2004;Zhao et al., 2000]. Likewise, the rate of evolution estimated from the sequence difference with chimpanzees (d 5 1.28%; m 5 1.07 Â 10 À9 per bp per year) represented the genomic average [Chen and Li, 2001], leading to an estimate of the effective population size of 9,600, a value typically obtained in studies of human DNA diversity. However, large variances in different diversity indices, in summary statistics, and in distinct estimates of the population mutation parameter Y (Table 1) suggested that the observed averages did not accurately reflect the extent of the investigated segments' diversities. This effect was particularly well captured by the analysis of the prevalence of the ancestral haplotypes and that of the major haplotypes among haplotype allelic classes (Fig. 2). In a large number of loci, the ancestral haplotype was a major haplotype, but at the same time, there were other, relatively numerous loci with MZ5 with rare or nonexistent ancestral haplotype. Together this led to a skewed, U-like distribution of the data plotted in Figure 2, showing departure from the neutral model based on its empirical evaluation by coalescence simulations. Therefore, it is tempting to propose that such a distribution is a product of the combined effects of the purifying selection acting on some of the loci and of adaptive evolution on the other loci. Indeed, significant results of statistical tests, such as that of Tajima (negative D) or Fay and Wu (negative H), are consistent with these opposite selection effects. On the other hand, the incongruence in different diversity estimators revealed by neutrality tests can be also ascribed to the demographic history itself affecting gene genealogies even in the absence of selection. It is usually argued that selection acts upon specific loci, while demography is common to all genomic segments and thus should affect them in the same way [Akey et al., 2004;Reed et al., 2005]. As a result, for loci sampled from the same population, the associated variance, due to shared demography, is expected to be lower and the detection of selection easier. Nevertheless, one has to consider that natural selection likely acts on different loci at different time periods and that the resulting diversity patterns are also differently and randomly affected by their genealogical histories. Given all this, when postulating the effect of selection, the results of different tests and different descriptive statistics, as well as geographic distribution of the genetic variants, should be considered together, including data on functional testing, if available.
A set of loci plausibly contributing to the observed excess of ancestral haplotypes among major haplotypes include CDC25A, CDKN2A, and SMAD4. These segments remained significant for negative Tajima's D (and negative D Ã of Fu and Li in the case of SMAD4) after correction for multiple testing. The interpretation of the haplotype-diversity-based tests, Fu's Fs and Chakraborty's statistics (Table 2) is more complicated. While corroborating Tajima's test in the case of the segments above, in the case of GPX2 the significant results can also be ascribed to the effect of recombinations and/or to the population amalgamation [Chakraborty, 1990]. The population amalgamation can be also invoked in the equally diverse MICA and HDAC1, where the effect of recombination can be neglected. In other words, a plausible genetic (recombination) or demographic (population structure) explanation for the observed diversity patterns can be proposed in these loci even in the absence of selection and despite that these data originate from the same population sample as those from other loci analyzed here. On the other side of the diversity spectra (Table 1; Figs. 1 and 2) are the segments with elevated Y H and M. They include BTN3A2, GPX3, GSTM3, and HTR2A, which were initially singled out by the Fay and Wu test; GSTP1, which remained significant for Fu and Li statistics after correction (Table 2); and CX3CR1. The latter was singled out by the departure from HWE and Fst values of its three segregating sites as well as by its distribution in the plot of haplotype allelic classes. Moreover, with S 5 16, Y p 5 4.03 (i.e., p 5 0.20%, 2 SD above the average), k 5 17, G 5 0.86, and Y k 5 6.032, CX3CR1was the most diversified segment among those analyzed in this study and the only one that turned out in the HKA test (po0.025; in which we compared each of the segments against the collection of 28 segments analyzed here). In these six loci, we observed an important skew toward high haplotype allelic classes ( Fig. 1;  Supplementary Fig. S2) compared to the neutral expectation under a simple population model. Consequently, all of them but GSTM3 (M 5 3) also contributed to a rightward skew of the data plot in Figure 2 that could not be explained by the demographic scenarios proposed by Akey et al. [2004] and only to some extent by recombination, although only at well above the average genomic rate.
The question is, to what extent are our findings concerning 5 0 flanking regions particular to these segments and to what extent are they representative for other noncoding 2-kb sequences. One can also argue that our sample is biased due to a particular set of genes we examined and therefore not representative for the other 5 0 flanking sequences. But this is almost admitting that these loci are special, indirectly reinforcing a selectionist interpretation, whereby the effects of purifying and adaptive selection did interchangeably create opposite patterns of diversity. In any case, our results provide a useful reference for future comparative analyses that will eventually show to what extent the observed variance in genetic diversity among sequence segments reflects the genomic reality and what part is attributable to selection, to stochastic effects, and to complex demographic histories. On the other hand, additional data will be required to dissociate the effects particular to the examined region from the influence of the linked, adjacent sequences. Indeed, there are numerous reports describing promoter regions as containing variant sites affecting their function and as containing variants associated with a disease or representing likely targets of selection. In this context, it is interesting to note that polymorphic site rs769214 in the CAT promoter (see Supplementary Table S2) was reported to be associated with different blood pressure levels (originally SNP844 in Jiang et al. [2001]). In turn, the new allele of the rs36228834 site in CDKN2A and the ancestral allele of the rs36228499 polymorphisms in CDKN1B were found to be associated with an increased risk of childhood acute lymphoblastic leukemia [Healy et al., 2006]. At another site in CDKN1B, the new T allele of rs3759217, with relatively high Fst in non-African populations, abolishes the myoblast-determining-factor binding site [Lassar et al., 1989], i.e., CANCtg-TANCtg (see TRANSFAC; www.gene-regulation.com/pub/databases.html), although the relevance of this mutation will have to be confirmed experimentally. In BTN3A2, with one dominant (68%) haplotype (Supplementary Figs. S1 and S2) carrying five derived alleles and conspicuously zero Fst (Table 1), a selection sweep preceding expansion of human populations could be suggested. Here, an adaptive change could have been associated with an increased transcription rate. By allelic-imbalance experiments [Pastinen et al., 2004], the expression of the ancestral haplotype was previously shown to be relatively suppressed. The same effect was demonstrated independently by in vitro transcription (N'Diaye, Pastinen, Paterson, Larivière, Labuda, Hudson, and Sinnett, unpublished data) of cloned constructs of the 2-kb BTN3A2 haplotypes from the present study. Similarly, in GSTM3, the functional analysis of its rs1332018G4T polymorphism (originally À63 A/C in Liu et al. [2005]) has shown eight-fold lower transcription activity of the G allele related to its nine-fold reduced RNA Pol II binding capacity [Liu et al., 2005]. This ancestral G is absent in Africans but occurs at high frequencies outside Africa. In turn, in HTR2A and TGFB1, strong purifying selection on their coding segments was reported by Bustamante et al. [2005], and it appears not to be reflected in diversity profiles of their 5 0 flanking sequence (Tables 1  and 2). In contrast, in IL1A, the effect of local adaptation postulated from the analysis of the whole gene by Akey et al. [2004], seems to be reflected by the presence of three equally prevalent haplotypes of this segment, consistent with the scenario of balancing selection. In fact, a newer version of its mRNA (REFSEQ NM_000575.3) shifts the transcription start sites by 1.6 kb and all its polymorphisms are now found within the 5 0 portion of the gene and none within the 0.4-kb upstream sequence analyzed here. We note also a possible shift in the promoter region of the CX3CR1 locus (NM_001337.3), with its new first exon located about 13 kb upstream with respect to the NM_001337.1 mRNA version we used. In contrast, the transcription start sites of the remaining genes either stayed the same or changed the position of the analyzed 2-kb segment by less than 0.2 kb.
In their recent work, Di Rienzo and Hudson [2005] proposed an evolutionary framework for common diseases, listing numerous examples where the ancestral allele represented a susceptibility variant. As shown above, such susceptibility alleles could be found in the promoter regions (CDKN1B) or represent plausible susceptibility candidates, conferring different expression activities and/or differing in geographic occurrence (e.g., BTN3A2 and GSTM3). If adaptive evolution is preferably regulatory [King and Wilson, 1975] and still prevalent in human populations, more regulatory variants could be expected in the promoter as opposed to in the coding segments. Our study provides new evidence suggesting the role of selection, including adaptive changes in the evolution of these segments. However, because molecular signatures of selection are relatively weak and our segments are short, selectionist interpretations have to be considered with caution. Nevertheless, recent findings point to the evolutionary importance of the cis-acting regulatory elements [Carroll, 2005;Rockman et al., 2005], and our results add to the increasing evidence suggesting that positive selection may be more pervasive in the human population than previously thought [Nielsen, 2005].