Dr. Peristera Paschou, Department of Molecular Biology and Genetics, Democritus University of Thrace, Panepistimioupoli, Dragana, Alexandroupoli 68100, Greece. Tel: +30 25510 30658; Fax: +30 25510 30613; E-mail: email@example.com
Studies of the genomic structure of the Greek population and Southeastern Europe are limited, despite the central position of the area as a gateway for human migrations into Europe. HapMap has provided a unique tool for the analysis of human genetic variation. Europe is represented by the CEU (Northwestern Europe) and the TSI populations (Tuscan Italians from Southern Europe), which serve as reference for the design of genetic association studies. Furthermore, genetic association findings are often transferred to unstudied populations. Although initial studies support the fact that the CEU can, in general, be used as reference for the selection of tagging SNPs in European populations, this has not been extensively studied across Europe. We set out to explore the genomic structure of the Greek population (56 individuals) and compare it to the HapMap TSI and CEU populations. We studied 1112 SNPs (27 regions, 13 chromosomes). Although the HapMap European populations are, in general, a good reference for the Greek population, regions of population differentiation do exist and results should not be light-heartedly generalized. We conclude that, perhaps due to the individual evolutionary history of each genomic region, geographic proximity is not always a perfect guide for selecting a reference population for an unstudied population.
The genomic structure of the Greek population has not been studied to date, despite its central position in influencing genomic variation throughout Europe. Greece has served as a gateway to migrations from Anatolia, as well as a key refuge of Northern European populations retreating to the South during the last glacial maximum (Semino et al., 2000; Di Giacomo et al., 2003; Semino et al., 2004; Novelletto, 2007). It is in Greece that the earliest signs of Neolithic farmers in Europe are found, about 7000 cal BC, with the founding of a fully fledged farming community at Knossos on the Island of Crete, followed slightly later in the Northwestern Peloponnese of mainland Greece (Efstratiou, 2005; Perlès, 2005). These populations were undoubtedly crucial to expanding farming to the rest of Europe (Di Giacomo et al., 2004; King et al., 2008). Furthermore, during the period of the Magna Graecia, the sea served rather as a bridge among populations than as a barrier, with Greek traders forming settlements throughout the coasts of Italy, France, and Spain (King et al., 2011).
The patterns of genetic variation across different populations, shaped by history, environment, and stochastic processes, have long been studied in order to infer population relationships and uncover the origins of the human species (Cavalli-Sforza et al., 1994; Tishkoff & Kidd, 2004). Early in the 21st century, the premise of genome-wide association studies (GWAS) was built upon the notion that common variation in the human genome could be tagged and interrogated by a small number of carefully selected single nucleotide polymorphisms (SNPs; the so-called tagging SNPs or tSNPs for short) (Daly et al., 2001; Johnson et al., 2001). This same notion motivated the HapMap project, aiming to characterize the linkage disequilibrium (LD) structure of the human genome (International HapMap Consortium, 2003, 2005; 2007). The HapMap project has become an incredibly valuable resource for investigators around the world, guiding their studies of the genetic background of human disease.
At the same time, the study of genetic structure within Europe has proven to be a lot more complex than until recently appreciated, and this could also be reflected upon the use of HapMap reference samples for the study of European populations. Two major axes of variation are observed within Europe, namely from North to South and from East to West (Lao et al., 2008; Novembre et al., 2008; Paschou et al., 2008; Drineas et al., 2010). The HapMap phase 1 project only included one European population, the CEPH Europeans (actually collected in Utah, USA) and shown to have northwestern European ancestry (International HapMap Consortium, 2003). Since the North-to-South cline of variation was discovered, a second population (Italians from the region of Tuscany) was selected, presumably as representatives of Southern European descent. It is worth mentioning that the Tuscan population, particularly during the Bronze Age and the Apennine Culture, had extensive trading relationships with the Minoan and Myceanean civilizations of Greece (Barker & Rasmussen, 2000).
Here, we present for the first time an extensive study of the genetic structure of the Greek population, in comparison to the HapMap reference European populations of Northern Europe (CEPH Europeans—CEU) and Italy (Tuscan Italians—TSI). We study a total of 1112 SNPs spread across 27 regions of the genome. We show that the HapMap reference populations should be expected to serve as a good reference of genetic structure in the Greek population if detailed analysis per region is not required. Regions of population differentiation do exist and results cannot be easily generalized for the entire genome. Furthermore, our results indicate that, perhaps due to complex population relationships and environmental pressures, geographic proximity is not always a perfect guide for selecting a reference population for an unstudied population.
Samples and Genotypes
We studied samples from 56 unrelated Greeks collected in Alexandroupoli, a city that lies in the northeastern corner of Greece. Participating volunteers were students of the Democritus University of Thrace (originating from many different regions of Greece), or healthy blood donors from the local University Hospital. Informed consent was taken from every participating individual. Self-reported ancestry was considered Greek if the individual reported all four grandparents to be of Greek ancestry and to have been born in Greece. DNA was extracted from whole blood using the Qiagen Puregene kit (Qiagen, Valencia, CA, USA).
Genotyping for 1813 SNPs across 27 different chromosomal regions was performed using an Illumina genotyping custom chip (Illumina, San Diego, CA, USA). The 27 regions across 13 chromosomes represent genomic regions that have been studied extensively at the laboratory of Drs. Kenneth and Judith Kidd at Yale University over the past few decades, in studies of human population structure around the world (Table 1, Supporting Information Table S1, and online supplement at http://www.cs.rpi.edu/~drinep/GREEKS/ ). We should note that these regions were a priori selected to include genes (Table S1), so we expect our results to be most informative of genic regions throughout the genome. The SNPs were selected to be informative (i.e. nonmonomorphic) and selection was based on distance (i.e. an attempt was made to create maps of markers at equal intermarker distance in each region). We should also note that the genotyped SNPs were explicitly chosen to be polymorphic, so our study is focused on common SNPs. For instance, only 9 and 15.6% of the studied SNPs in the CEU have a rare allele frequency below 10 and 15%, respectively (see online supplement for details on all studied SNPs in all three populations). Among the studied regions, there was one exceptionally large region on Chromosome 17 (about 4.6 Mb and 320 genotyped SNPs at an average intermarker distance of 14.3 kb), whereas for the rest of the regions, the average size was about 420 kb, covered on average by 35 SNPs at an average intermarker distance of 13.1 kb (Table 1).
Table 1. A detailed description of the 27 chromosomal regions (a list of all SNPs is available in the online supplement). The last column is an indication of the “normality” of the region in terms of outlier SNPs, that is, SNPs with abnormally high or abnormally low PCA scores (see Methods section for details). Note that the vast majority of the regions fall below 40 and 60%, a strong indication that the selected regions have normal behavior.
No. of SNPs
Avg. intermarker distance
Avg. PCA score
Of the 1813 SNPs that were genotyped for our Greek population, we were able to find 1112 SNPs genotyped in both the HapMap CEU (112 individuals) and TSI (88 individuals). Only unrelated HapMap individuals were retained. We extracted the relevant data from the HapMap phase 3 database and produced a joint data set of 1112 SNPs and 256 individuals from three European populations that became the focus of our analysis.
Analysis of LD and Population Structure
We visualized the data via pricipal components analysis (PCA), a well-known dimensionality reduction technique. In prior work (Paschou et al. 2007b; Paschou et al. 2008), we have extensively described how to encode and mean-center genotypic data in order to apply PCA. In our setting, PCA represents all samples with respect to the top two principal components (eigenSNPs). Our choice of two eigenSNPs stems from extensive prior work on the analysis of European genotypic data (Paschou et al., 2008; Drineas et al., 2010).
In order to characterize the profile of the studied regions in terms of population differentiation, we used data from the POPRES (population reference) sample (Nelson et al., 2008). The subset of the POPRES data set that we analyzed comprises 1200 individuals from 11 European populations and has been described in detail previously (Novembre et al., 2008). For each SNP, we computed its correlation with the top two principal components of the data set (PCA-scores), which have been shown to capture the most significant axes of genetic variation within Europe (Lao et al., 2008; Novembre et al., 2008). PCA-scores were computed as we have previously described (Paschou et al., 2007b; Paschou et al., 2008), and they were compared to the distribution of PCA-scores for all available SNPs in the POPRES data set (447,212 SNPs).
Pairwise linkage disequilibrium tests, as well as tagging SNP (tSNP) selection and haplotype block definition, were performed using the algorithms implemented in Haploview (Barrett et al., 2005). For tSNP selection, the Tagger algorithm (as integrated in Haploview) was used without the multimarker testing option. The Gabriel et al. (2002) definition was applied here in order to define “haplotype blocks” across the studied regions. Haplot was used in order to visualize block boundaries across the three studied populations (Gu et al., 2005). We also measured the percentage of “block overlap” (see Supplementary Methods for details) between pairs of populations, in order to evaluate the similarity of haplotypic blocks between the Greek samples and the TSI samples (and vice versa), as well as the Greek samples and the CEU samples (and vice versa).
Allele Frequencies and Population Differentiation over Studied Regions
A total of 1112 SNPs from 27 chromosomal regions were included in our analysis. The PCA of analyzed genotypes for the three populations is shown in Figure S1. In order to characterize the profile of allele frequencies over the studied regions in comparison to the entire genome, we used the POPRES genome-wide data set of Europeans and compared the PCA-scores of SNPs across the studied regions to that of the remaining genome (Table 1). In order to compute the PCA-scores, we effectively calculated the correlation of each SNP with the top two principal components of the data set; these components have been previously shown to correlate with ancestry across Europe (Lao et al., 2008; Novembre et al., 2008; Drineas et al., 2010). For each region we studied here, we estimated the average PCA-score of all available SNPs within the region. Table 1 shows the percentage of SNPs in the genome with a higher PCA-score than the average PCA-score in a particular region. As we have analyzed in detail in earlier work, a high PCA-score is expected for SNPs that show high association with population ancestry.
Our results show that most regions that are included in our analysis are “average” in terms of allele frequencies and correlation to ancestry. Thus, our findings here can be considered, at least to some extent, representative for most regions of the genome as well, when it comes to the level of population differentiation. However, we would like to briefly comment on the top five population-differentiating regions included in our analysis (less than 40% of genome-wide SNPs show a higher PCA-score than the average PCA-score of the respective region). The first is the region of 17q21 in Chromosome 17, encompassing the MAPT gene and the recently identified inversion haplotype H2 (Stefansson et al., 2005). As we have recently shown, this 17q21 inversion, often thought to be found at levels of ∼20% throughout Europe, actually shows a great range of frequencies within Europe (ranging from 5% up to 37.5%) (Donnelly et al., 2010). The inverted H2 haplotype is actually most frequent around the Mediterranean and decreases outward in all directions. The second region spans the SLC44A5 and ACADM genes and has also been previously shown to account for high population differentiation as a possible candidate region for recent positive selection (Voight et al., 2006; Zhong et al., 2010). The third region spans the LCT gene, well known for its involvement in population differentiation across Europe (Bersaglieri et al., 2004; Campbell et al., 2005). The fourth region encompasses the CCR5 and neighboring genes. CCR5 is the coreceptor that HIV most commonly uses to enter target cells, and, in fact, specific variants of this gene have been associated with protection from HIV infection (de Silva & Stumpf, 2004). This region is well known to show population differentiation and has also been implicated as a candidate locus for natural selection (Novembre et al., 2005; Sabeti et al., 2005; Edo-Matas et al., 2011). Finally, the fifth region with average PCA-scores above the 40th percentile includes the CA10 gene. Variation across this particular gene has not been previously suggested as population differentiating, even though it lies approximately 6 Mb away from the aforementioned MAPT region.
Allele frequencies across studied SNPs for the three European populations are highly correlated, as shown in the scatter plots of Figures 1(A) and (B). As expected, higher correlation is observed between the Greek and the Tuscan population with a Pearson correlation coefficient r= 0.9643. The respective value for the correlation between the Greek and the CEU population is r= 0.9387. In order to identify the most population differentiating SNPs in our data set, we calculated the Informativeness (In) of each studied SNP, as defined by Rosenberg et al. (2003) (Fig. S2). Outliers were defined as SNPs whose In value exceeds the mean plus three standard deviations. Two SNPs were thus identified as outliers between the Greeks and the Tuscans (residing in the CD4 and ADH4 regions, respectively) and 23 SNPs were identified as outliers between the Greeks and the CEPH Europeans. Of the latter 23 SNPs, the top one resides in the CD4 region and the remaining 22 are found across the LCT region.
LD Structure in Greeks Compared to the HapMap 3 European Populations
We measured the extent of LD between all SNP pairs at a distance of 50 kb maximum in the Greek population and compared it to the respective estimates in the CEU and TSI populations. As shown in the scatter plots of Figures 1(C) and (D), most SNP pairs show a high degree of correlation between Greeks and HapMap European populations, with the average correlation coefficient being 0.9622 for comparisons between Greeks and TSI and 0.9614 for comparisons between Greeks and CEU. However, even though the average is high, isolated regions of lower correlation should not be overlooked. An SNP pair was defined to be an outlier if its residual fit to the diagonal (perfect correlation) exceeded the average residual plus three standard deviations. A total of 83 such discordant pairs were found in the Greeks to TSI comparison, whereas 104 such pairs were found in the Greeks to CEU comparison (see Supporting Information and online supplement at http://www.cs.rpi.edu/~drinep/GREEKS/ for considerations on the characteristics of the outlier pairs as well as a list of those pairs).
In order to further study the LD structure of the regions in the three European populations, we defined haplotype blocks in each of the studied regions using the criteria proposed by Gabriel et al. (2002) as implemented in Haploview. Results of this analysis are shown in Figure S3. A total of 143 blocks over the studied regions were found in the Greek population, whereas 169 and 176 blocks were found in the TSI and CEU populations, respectively. The average block size was 34.2 kb in Greeks, 31 kb in the TSI, and 34.2 kb in the CEU.
In an effort to quantify the degree of similarity between the haplotype blocks and LD structure in the Greek population compared to the HapMap European populations, we estimated the overlap of blocks defined in the Greek population and those defined in each of the TSI and CEU populations. The (average) block overlap values from the GRE samples to the CEU samples (and vice versa) and the (average) block overlap values from the GRE samples to the TSI samples (and vice versa) are shown in Table 2 for each of the 27 studied regions. The overall average overlap values are very similar: block overlap from GRE to CEU is (on average) 83%; block overlap from CEU to GRE is (on average) 65%; block overlap from GRE to TSI is (on average) 76%; and, block overlap from TSI to GRE is (on average) 73%. The average F1 statistics for the two pairs of populations are essentially the same: 71% for the GRE and CEU pair, and 72% for the GRE and TSI pair.
Table 2. Average block overlap per region for both pairs of populations (GRE and CEU and vice versa, as well as GRE and TSI and vice versa). The F1 statistic (see Methods section for details) summarizes the two measurements for each pair of populations. The correlation coefficient between the GRE-to-CEU block overlaps and their reciprocals is equal to 0.73; the correlation coefficient between the GRE-to-TSI block overlaps and their reciprocals is equal to 0.57.
GRE to CEU
CEU to GRE
F1 (CEU, GRE)
GRE to TSI
TSI to GRE
F1 (TSI, GRE)
Selecting tSNPs in the Greek versus the HapMap European Populations
Next, we investigated the degree to which the HapMap European populations could serve as good reference samples for the selection of tSNPs in the Greek population. In order to do so, we selected tSNPs in each of the HapMap European reference samples (CEU and TSI) and proceeded to test the coverage and efficiency achieved by the selected tSNPs in the Greek population in all studied regions. Results were compared to coverage and efficiency of tSNPs selected from the Greek sample and are shown in Figures 2 and 3. Coverage is defined as the percentage of “untyped” SNPs in the studied region with r2 > 0.8 with a tSNP in the Greek population.
Overall, both the TSI and CEU, when used as reference for tSNP selection, achieve very good coverage of variation in the Greek population (Fig. 3). The TSI tSNPs capture, on average, 94.7% of “untyped” SNPs in Greeks, whereas the CEU capture a somewhat smaller percentage at 92.4%. For about half (13) of the studied regions, both TSI and CEU achieve exactly the same coverage, with exactly the same efficiency (number of selected tSNPs) in nine of these 13 regions. Interestingly, for eight of the regions we studied (regions 1, 2, 4, 6, 8, 9, 18, and 22) both the TSI and CEU achieve perfect (100%) coverage of variation in the Greek population. For three of these regions (2, 4, and 22), the CEU are actually more efficient as reference for Greeks, with fewer tSNPs needed; for the remaining regions, the same number of tSNPs is selected in both populations.
The TSI outperform the CEU as reference for the Greek population in 10 regions. However, there are four regions where the CEU are actually better reference samples than the TSI, contrary to what one might expect based on geographic proximity of the populations. Among them, the most notable are a region of chromosome 7 (100% coverage using the CEU as reference vs. 91.7% using the TSI as reference) and the chromosomal region around COMT (100% coverage using the CEU as reference vs. 93% coverage using the TSI as reference). The Chromosome 7 region spans the TAS2R38 gene (responsible for the PTC taster/nontaster phenotype), as well as the CLEC5A gene. The latter gene has been found to have a role in immune response and interact with dengue virus. Finally, the SLC44A5 regions (discussed in previous sections as one of the most population-differentiating regions in our study) were also captured more accurately in Greeks when the CEU were used as the reference population as opposed to the TSI.
Of the 27 studied regions, five regions resulted in coverage less than 90% when the TSI were used as reference. Of these five regions, one is captured at a percentage of less than 80% (79.3% coverage). This region is the LCT region, which is well known to be correlated with population differentiation between Southern and Northern European populations. It is not surprising that this same region results in the lower coverage when the CEU population is used as reference for the Greek population (a coverage of only 65% is achieved). Using the CEU population as reference for the Greek population results in two more regions with coverage below 80%; interestingly, these two regions were the most population-differentiating regions in our sample, according to the PCA-scores-based analysis that we described earlier. More specifically, these regions are: (i) a Chromosome 12 region spanning a large number of genes including CD4 (66.6% coverage), and (ii) the Chromosome 17 CA10 region (73.3% coverage). A total of seven of the 27 studied regions cannot be covered at a percentage higher than 90% when the CEU population is used as the reference population (Fig. 3).
Overall, for regions that show less than 95% coverage in Greeks with TSI or CEU tSNPs, LD patterns are typically more complicated in the Greek population. Indeed, more tSNPs would have been selected if the Greek population was used as reference for itself (64.7% of the available SNPs were selected, on average, as tSNPs in such regions in the TSI, whereas 71.4% of the available SNPs were selected, on average, as tSNPs in the Greeks for the same regions; the respective numbers for such regions in the CEU and Greeks comparison were 64.4 and 69.4%).
Finally, we also performed the opposite experiment. We selected tSNPs in Greeks and attempted to see if they could serve as good reference for the TSI and CEU (Fig. 3B). Our premise was that if the Greek population has a more complex structure than the TSI and the CEU, it could serve as a better reference population for European genetic variation. Indeed, the Greek population appears to be a better reference for the CEU population than the CEU is for the Greek population, at least for the regions studied here. On average the Greek tSNPs covered 96.2% of the CEU variation across the studied regions versus the 92.4% coverage achieved by the reverse comparison. On the other hand, the comparison between the Greek and TSI population yields equivocal results (94.6% coverage of TSI variation with Greek tSNPs vs. 94.7% for Greek variation coverage with TSI tSNPs). Nevertheless, upon closer examination of results, we observe that more regions in the TSI are captured with greater than 95% coverage with the Greek tSNPs than the other way around. With the Greek tSNPs 13 regions in the TSI data are covered at 100%, whereas 18 regions are covered at above 95%. For the reverse experiment, (TSI tSNPs applied to the Greeks) the numbers were 9 and 16 regions, respectively.
It has long been understood that allele and haplotype frequencies, as well as LD patterns and haplotype block structure, differ across worldwide populations (Sawyer et al., 2005). Within Europe, genetic variation has been shown to be distributed across two major axes (Novembre et al., 2008). However, we are only now beginning to appreciate the extent to which genomic structure differs among European populations and how this structure could possibly affect the design and interpretation of GWAS.
The HapMap project has created an extremely valuable resource for the worldwide scientific community, providing the most comprehensive catalog of genetic variation to date across multiple populations (International HapMap Consortium, 2007). It is generally accepted that the HapMap data set is hampered by a selection bias in the studied populations and by the inclusion of common SNPs, overlooking rare population variants. However, a number of studies have supported the intuitive fact that, when designing an association study in an unstudied population (i.e. not included in the HapMap project), and if interested in common variation, the optimal solution is to select the geographically closer HapMap population as reference, suggesting this would yield satisfactory coverage of the unknown population (Gu et al., 2007, 2008). At the same time, a couple of studies have attempted to alert researchers to the fact that genomic evolution is not homogeneous: different stochastic and biological factors may influence regions of the genome in a different way, thus producing unexpected patterns (Mueller et al., 2005; Pardo et al., 2009).
Here, we are presenting for the first time a large-scale study of genomic structure of the Greek population, in comparison to the HapMap reference populations of Northern (CEU) and Southern Europe (TSI). We studied 27 regions across the genome and 1112 SNPs, attempting to get a glimpse of the genomic structure of the Greek population and determine which of the HapMap reference populations could best represent variation in Greeks. So far, studies of genetic structure of the populations of Southern and Eastern Europe have been scarce and largely limited to Y chromosome and mitochondrial variation (Semino et al., 2000; Parreira et al., 2002; Richards et al., 2002; Di Giacomo et al., 2003; Malyarchuk et al., 2003; Robino et al., 2004). Little is known about the genomic architecture of these populations, despite their value in understanding the genomic structure of the European gene pool. The Balkan peninsula and Greece have been the gateway of human migrations to Europe during the Paleolithic and Neolithic ages, whereas the Bronze and Iron Ages were marked by the influence of the Greek culture and trading (Bosch et al., 2006). It is therefore clear that understanding the genetic structure of these populations could provide important insights into the genetic structure of the whole of Europe.
In general, after our detailed comparison of the Greek population with the HapMap Europeans, the patterns of our results are not as straightforward as one might have predicted based on intuition alone. It is clear that when thinking in terms of “averages,” both the TSI and the CEU can be considered good reference populations for the Greek population. However, we should emphasize that there exist genomic regions that will be captured poorly by either the TSI or the CEU. Furthermore, again based on “averages” and geographic proximity, one might be quick to conclude that the TSI are a better reference for the genomic variation of the Greek population compared to the CEU. However, our results reveal a rather surprising finding: despite an overall greater degree of similarity among Southern European populations, there exist regions of the genome where the CEU, a population of Northern European descent, is actually more representative of the Greek population than the Southern European Italian Tuscan population (TSI). Our results underline the fact that the evolution of the human genome is extremely complex and generalizations should be avoided.
Great care should be taken when interpreting GWAS results from a population that was not among the HapMap populations and, even more, when attempting to transfer GWAS findings from one population to another. Our findings indicate that the existence of local and population-specific LD variation in the human genome could significantly impair the design of association studies in the Greek population, if, for example, the variant in question lies in a region that is poorly covered by the HapMap reference samples. This could also be the case for other Balkan populations, but also for any population that is not included in the HapMap project. As the quest for the missing heritability component of common disorders becomes more pressing, it becomes apparent that such differentiating regions could have a negative impact in the accuracy of existing association studies. Moreover, their presence suggests the hypothesis that, by using the HapMap populations as reference, researchers may fail to appreciate population-specific variation, a deficit which can only be addressed by large-scale studies of a large number of carefully defined populations.
This work was supported, in part, by two Tourette Syndrome Association (TSA) Research Grant Awards to PP; a National Science Foundation (NSF) CAREER award to PD; and a European Molecular Biology Organization Short-Term Fellowship to PD. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.