Variation at Diabetes- and Obesity-Associated Loci May Mirror Neutral Patterns of Human Population Diversity and Diabetes Prevalence in India


Corresponding authors: Srilakshmi M Raj, 101 Biotechnology Building, Cornell University, Ithaca, NY 14853. Tel: +1 607 255 2556; Fax: +1 607 255 6249; E-mail: Toomas Kivisild, Leverhulme Centre for Human Evolutionary Studies, University of Cambridge, The Henry Wellcome Building, Fitzwilliam Street, Cambridge CB2 1QH, UK. Tel: +44 (0)1223 764703; Fax: +44 (0) 1223 764710; E-mail: Kumarasamy Thangaraj, CSIR-Centre for Cellular and Molecular Biology, Hyderabad 500 007, India. Tel: +91 40 27192828; Fax: +91 40 27160591; E-mail:


South Asian populations harbor a high degree of genetic diversity, due in part to demographic history. Two studies on genome-wide variation in Indian populations have shown that most Indian populations show varying degrees of admixture between ancestral north Indian and ancestral south Indian components. As a result of this structure, genetic variation in India appears to follow a geographic cline. Similarly, Indian populations seem to show detectable differences in diabetes and obesity prevalence between different geographic regions of the country. We tested the hypothesis that genetic variation at diabetes- and obesity-associated loci may be potentially related to different genetic ancestries. We genotyped 2977 individuals from 61 populations across India for 18 SNPs in genes implicated in T2D and obesity. We examined patterns of variation in allele frequency across different geographical gradients and considered state of origin and language affiliation. Our results show that most of the 18 SNPs show no significant correlation with latitude, the geographic cline reported in previous studies, or by language family. Exceptions include KCNQ1 with latitude and THADA and JAK1 with language, which suggests that genetic variation at previously ascertained diabetes-associated loci may only partly mirror geographic patterns of genome-wide diversity in Indian populations.


Disentangling the contribution of environment and genetics to complex disease risk requires large amounts of genetic data on large numbers of individuals, the usage of appropriate statistical models, and information on environment and phenotype. Aspects of these issues have proven to be a challenge especially for non-European populations (Need & Goldstein, 2009; Bustamante et al., 2011). Yet, often, these populations exhibit different etiologies and greater risk of certain complex diseases (Kumar et al., 2010; Gravel et al., 2011). Indians in general do not tend to develop high BMI compared to other global populations, yet have a high risk of type 2 diabetes (T2D) and have among the highest number of cases in the world, totaling over 51 million (McKeigue, 1989; McKeigue et al., 1991; International Diabetes Federation, 2009; Diamond, 2011; Finucane et al., 2011). This trend may be due in part to higher visceral fat deposition in Indians, suggesting an underlying biological basis for high T2D risk in Indians (McKeigue et al., 1991).

Why Indian populations exhibit such high risk of T2D remains an open question, however. Some studies suggest that Indians have a “thrifty phenotype,” which indicates that risk is predominantly due to environmental factors such as low birth weight and maternal nutrition status (Hales & Barker, 1992; Yajnik, 2000, 2004). Others have suggested a “thrifty genotype” in which evolutionary adaptations to harsh environmental conditions molded a genetic predisposition to energy thrift, which has become maladaptive in the presence of caloric abundance (Neel, 1962, 1999). Studies that have tested the thrifty genotype hypothesis have thus far not yielded candidate genes that appear to be thrifty in the context of T2D, with the possible exception of PPARGC1A, a gene that is associated with BMI in Tongans and may be under positive selection in that population (Paradies et al., 2007; Southam et al., 2009; Myles et al., 2011).

A critical step toward understanding the genetic basis of disease etiology is the understanding of local versus global patterns of genetic diversity. Yet, only a handful of attempts to study Indian genetic variation on a genome-wide scale have been published so far (Indian Genome Variation Consortium, 2008; Reich et al., 2009; Metspalu et al., 2011). One of the first genome-wide studies on Indian populations included 132 individuals from 25 populations across India (Reich et al., 2009). The study demonstrated that most Indian populations are derived from a mixture of two major groups, ancestral north Indians (ANI) and ancestral south Indians (ASI), with proportions of the ANI component varying from 39% to 71%. The pattern of two major ancestry components has been confirmed in a separate study including 142 samples from 30 Indian populations (Metspalu et al., 2011). Long-term genetic isolation among populations, possibly amplified by the social structuring of the caste system, may have heightened the effects of genetic drift, contributing to the high degree of population structure observed. This substructure implies that Indians may have an excess of certain recessive genetic disorders compared with other populations (Reich et al., 2009).

In addition to possible consequences for disease predisposition, genetic diversity across India may follow a geographic cline. Thus far, evidence of a latitudinal cline in India has been mixed. One study on candidate disease-associated SNPs showed that genetic variation does not appear to vary along a latitudinal cline within India (Pemberton et al., 2008). A genome-wide study of genetic variation in India, however, reported a geographic (northwest to southeast) gradient of relatedness extending from Europe to India, which they call “the India cline,” perhaps reflecting a gradient in ANI-ASI admixture proportions (Reich et al., 2009). Supporting the hypothesis of a genetic basis for T2D susceptibility in India is the appearance of a north-south gradient in diabetes prevalence, mirroring the genetic variation-based India cline. Cities in the state of Kerala in south India have up to threefold higher T2D prevalence than the northern-most state, Kashmir (Ramachandran et al., 2001; Deepa et al., 2003; Mohan et al., 2006; Fig. S1). The distribution of BMI in India shows a different trend, with higher BMI values among both men and women in north and south Indian states, but lower BMI values in central regions of India (Fig. S1).

These two clines reflected in genetic and T2D prevalence data may indicate a relationship between diabetes susceptibility and genetic variation in India. We studied the distribution of genetic variants associated with susceptibility to T2D and obesity in Europeans, in Indian populations sampled at a fine geographical scale. This was conducted with the aim of assessing whether obesity and T2D risk alleles follow geographic patterns within India consistent with known distributions of disease prevalence and genetic ancestry. Compared to many previous studies, which have focused on specific populations, this study uses over 3200 individuals from 61 different populations sampled across India and its north-south cline. Our sample includes populations from diverse ethnic, linguistic, geographic, and cultural backgrounds.

Materials and Methods

Populations Selected for Genotyping within Karnataka and other States of India

At the national level, 1530 individuals belonging to 38 populations outside Karnataka state were genotyped for SNPs associated with T2D and obesity (Table 1; Table S1). Besides the cross-national level, we focused on genetic variation within the single state of Karnataka in India to minimize cultural, geographic, and linguistic differences among populations. We collected over 1500 saliva samples of reportedly unrelated individuals (separated by at least two generations) belonging to 14 populations across Karnataka; 1447 of these individuals were included in the final analysis. Populations were selected to represent a diverse cross-section of variation in Karnataka and included all five major caste groups and two major tribal groups. All samples were collected with the informed written consent of the donors and the study was approved by the Institutional Ethical Committee of the CCMB.

Table 1. Description of the data. Table 1a. The number of individuals and populations included in the paper. Further information is available in Tables S1, S3, and S4. Table 1b. List of SNPs genotyped in Indian populations
Population categoryNumber of individualsNumber of populations
Populations genotyped for 18 SNPs
Within Karnataka144723
Outside Karnataka153038
State297716 (states)
Language family29774 (language families)
Publicly available datasets (total 506,306 SNPs)
World, interpolated maps189894
Reich cline, outside India8295 (geog. regions)
Within India3119
  1. The “Discovery” column refers to the reasons that the SNP was chosen for genotyping. In the risk allele column, the actual risk alleles are indicated in bold while the rest are minor alleles, unless otherwise indicated.

rs10146997 (A>G)NRXN314ObesityGWAS—waist circumferenceGG(Heard-Costa et al., 2009)
rs10229583 (G>A)PAX47T2D1% iHS South IndiansAA(Gaulton et al., 2008)
rs10811661 (T>C)CDKN2A/B9T2DGWATT(Zeggini et al., 2008)
rs11208534 (A>G)JAK11T2D5% iHS South IndiansGG(Gaulton et al., 2008)
rs12330015 (A>G)PPARA22T2D1% iHS South IndiansGA(Gaulton et al., 2008)
rs12970134 (G>A)MC4R18ObesityGWA—waist circumference in IndiansAG(Chambers et al., 2008)
rs13220810 (T>C)FOXO3A6T2DHighly conserved   
    Role in ageingCC(Willcox et al., 2008)
rs1349498 (G>A)RAPGEF42T2D1% iHS South IndiansAG(Gaulton et al., 2008)
rs1713222 (C>T)APOB2T2D1% iHS South IndiansTT(Gaulton et al., 2008)
rs17647588 (C>T)NFE2L22T2D1% iHS South IndiansTT(Gaulton et al., 2008)
rs17782313 (T>C)MC4R18ObesityGWACC(Loos et al., 2008)
rs2237892 (C>T)KCNQ111T2DGWA in AsiansCT(Unoki et al., 2008; Yasuda et al., 2008)
rs6802898 (C>T)PPARG3T2D1% iHS South Indians; GWACC(Altshuler et al., 2000; Gaulton et al., 2008)
rs7578597 (T>C)THADA2T2DGWACC(Zeggini et al., 2008)
rs7903146 (C>T)TCF7L210T2DGWATC(Saxena et al., 2006)
rs985694 (C>T)ESR16T2DGaulton (2008) candidateTC(Gaulton et al., 2008)
rs9911630 (G>A)BRCA117Breast cancerPotential candidateAA(Miki et al., 1994; Larsson et al., 2007)
rs9939609 (T>A)FTO16ObesityGWAAT(Frayling et al., 2007)

Published Genome-Wide Data Sources

Indian samples were grouped by geographic region, language family, or caste/tribe status (Table S4). Because Uttar Pradesh Brahmins and Gujaratis had larger sample sizes compared with other south Asian populations, they were not grouped into these broader categories but were analyzed separately. All populations had a minimum sample size of seven individuals. We estimated genome-wide average FST among populations from a combined dataset including 506,306 SNPs. PLINK software was used to assemble all genome-wide marker data (Purcell et al., 2007).

Data from 1898 individuals belonging to 94 distinct global populations drawn from six published sources were included in this study (Li et al., 2008; Behar et al., 2010; Rasmussen et al., 2010; The International HapMap Consortium, 2010; Gallego Romero et al., 2011; Metspalu et al., 2011; Table S3).

SNP Selection for Genotyping

Samples from Karnataka (1447 samples) and from the CCMB collection (1530 samples) were genotyped for 18 SNPs in gene regions with potential roles in T2D and obesity etiology in Indian populations. Seven of these variants have been confirmed to be associated with either T2D or obesity in GWA studies (Table 1). Because the association of these SNPs with T2D or obesity was determined in largely European GWA studies, and other SNPs may also serve as good candidates for these diseases in Indian populations, we used other methods to select SNPs that may be candidates for T2D and obesity risk in Indians. An additional 6 out of the 18 variants came from a list of 222 candidate genes involved in T2D (Gaulton et al., 2008), selected on the basis of evidence of scans of extended haplotype homozygosity applied on south Asian populations, using the linkage disequilibrium-based iHS statistic (Voight et al., 2006; Metspalu et al., 2011). We isolated genes belonging to the top 1% to top 5% of iHS scores for inclusion in the study. Other SNPs were chosen based on other biological indicators of their candidacy (Table 1). Of the 18 variants, the seven variants rs10811661, rs12970134, rs17782313, rs2237892, rs7578597, rs7903146, rs9939609 were also found to be associated with T2D, obesity, or related traits in Asian Indians (Bodhini et al., 2007; Chambers et al., 2008; Rees et al., 2008; Yajnik et al., 2009; Been et al., 2011; Rees et al., 2011; Taylor et al., 2011; Dwivedi et al., 2012; Gupta et al., 2012; Li et al., 2012; Vasan et al., 2012; Dwivedi et al., 2013).

The final criterion for SNP selection was compatibility in the multiplex design. Taken together, the full list of T2D- and obesity-associated SNPs, candidate SNPs in the Gaulton et al. (2008) list, as well as other candidate loci provide several hundred testable SNPs. The Sequenom genotyping platform (Sequenom GmbH, Hamburg, Germany) uses a multiplex-based system to type multiple SNPs at the same time. An algorithm provided by Sequenom was used to create optimal combinations of SNPs to minimize the chance of SNP genotyping failure.

Upon testing for Hardy-Weinberg equilibrium and applying a Bonferroni correction to all SNPs in all samples, only two populations showed significant deviation from HWE (Table S1). We decided to retain the populations and SNPs showing deviation from HWE in the analysis, because of: (1) confidence in the genotype scoring method, (2) large sample size, and (3) a potential role of the SNPs in T2D and obesity etiology in Indians. Genotyping these SNPs, and resequencing these regions in additional, independent cohorts of Reddy and Ao Naga populations is required to confirm the significance of the deviations from HWE.

DNA Isolation from Saliva

DNA isolation from saliva samples was carried out using two different protocols. The first involved DNA extraction kits (Oragene, DNA Genotek Inc., Kanata, Canada) that were used to collect saliva and extract DNA from approximately 400 of the 1500 participants from Karnataka state, India. Saliva collection and DNA extraction were carried out according to manufacturer's protocols. DNA pellets were dissolved in 100 μl of autoclaved double-distilled water, or autoclaved milliQ water.

The majority of saliva samples collected in Karnataka were processed using a noncommercial DNA extraction protocol (Quinque et al., 2006), which was further modified to accommodate variations in saliva-buffer solution amounts across subjects. For each milliliter of saliva-lysis buffer solution, 15 μl proteinase K, at a concentration of 30 mg/ml (Sigma-Aldrich, Bangalore, India), 75 μl 10% SDS, and 200 μl 5M NaCl were added into the conical tube containing the sample. The proteinase K was kept on ice, and the sample tubes and 10% SDS were kept at room temperature, prior to the above step. The tubes were then placed into a shaking water bath for 24 h at 53°C. Crude proteinase K was sometimes used. In this instance, proteinase K was incubated at room temperature for 10 min to allow it to remove its own proteases. The concentration was increased to 50 mg/ml. Samples were also incubated for 36 h instead of 24 h. DNA samples were subsequently stored in autoclaved, distilled water.

Genotyping Using Sequenom Platform

All samples were genotyped using the MassARRAY system (Sequenom GmbH) for 32 SNPs (32-plex system), although only 18 of these SNPs are included in the present study. All reactions were carried out according to manufacturer's protocols. Approximately 5 ng of DNA was used for each reaction, corresponding to 1 well on a 384-well plate. A linear regression to calculate the appropriate primer dilutions and amounts of primer to be pooled for the 32-plex reaction was used for greater accuracy. Genotypes were inferred using the software provided with the instrument. We obtained >95% SNP calls for all 32 SNPs, with the exception of one of the 384-well plates, for which 14% of SNP calls were not available. Ambiguous genotypes were visually scored.

Geographic Analyses

ESRI ArcMap software (v. 9.2) was used to visualize spatial patterns of allele frequencies on geographic maps. Shapefiles of the world and India were obtained online (, Interpolated patterns of allelic variation on a global level were generated using the inverse-distance weighted method as implemented in the Spatial Analyst Tools function within ArcMap.

These interpolations were made based on the 12 nearest points to the region of interpolation, restricted to land-only boundaries. To extend the interpolation for full global coverage, four dummy points were created to represent the extreme points of the map, with dummy frequencies, which fell in the range of the frequency values. Any observed latitudinal or longitudinal patterns were confirmed by Spearman rank correlation, with Bonferroni-corrected p-values to calculate statistical significance.

A Mantel test was used to test for the relationship between genetic and geographic distance. For each pair of populations, we calculated geographic distance in kilometers based on great circle distances measured using the haversine formula (Sinnott, 1984).

Assuming that until very recently populations followed a land-only migration route from Africa and avoided crossing large bodies of water, obligatory waypoints were added to calculate pairwise population distances across continents. We incorporated the five waypoints used by Ramachandran et al. (2005). As we also included several Indian populations, we added two additional waypoints: Karachi, Pakistan (25.0, 69.0) and Kolkata, India (22.6, 88.4), through which all populations entering the Indian subcontinent from the west and the east, respectively, were forced to travel.

Descriptive Statistics

Estimates of descriptive statistics such as observed and expected levels of heterozygosity, and tests for Hardy-Weinberg equilibrium were calculated using GDA software (Lewis & Zaykin, 2001). Allele frequencies were calculated using Arlequin v. software (Excoffier et al., 2005).

GDA implements the unbiased estimator of observed heterozygosity proposed by Nei (Nei, 1987). According to this formula, observed heterozygosity (HO) = inline image, in which Xii is the relative frequency of each of the k possible homozygous genotypes at a given locus.

Expected heterozygosity is implemented according to Nei (1987), in which expected heterozygosity (HS) in the sample (calculated as 1 −inline image) is multiplied by the factorinline image to account for variation in population sizes. Here, pi is the frequency of the alleles observed at a given SNP.

Genetic distances, or degree of population differentiation, were measured using FST, for all pairs of populations for each SNP, using FSTAT software v. 2.9.3 (Weir & Cockerham, 1984; Goudet, 2001). The unbiased estimate of FST can sometimes have negative values, which do not have biological significance, or may result in error values if minor allele count is zero in a pair of populations. The negative and error FST values were thus set to zero.

FST values were calculated as inline image, where a, b, and c are determined by equations 2, 3, and 4 in Weir & Cockerham (1984). FST values estimated across multiple markers, or on the genome-wide set of markers were calculated using the mean of each of a, b, and c.

Other Statistical Analyses

The program R (v. 2.11.1) was used to calculate Fisher's Exact tests as well as Spearman rank correlations (R Development Core Team, 2006) between pairs of variables. P-values for significance of the Spearman rank correlations were corrected for multiple testing using Holm's method, equivalent to the Bonferroni method of correction.

Mantel tests for correlation between genetic and geographic distances, generated through FSTAT and the haversine, respectively, were calculated using GenAlEx v. 6.4 (Peakall & Smouse, 2006).


Hardy-Weinberg equilibrium tests revealed that the 18 genotyped SNPs (Table 1) were in HWE in most of the 61 populations, with only two exceptions after Bonferroni correction (Tables S1 and S2).

To place frequency differences among the 61 Indian populations into a global geographic context, we examined allele frequency differences at several geographic scales: global, across a continental India cline, within India, among state and language groups within India and within the single state of Karnataka.

Global Scale/Eurasia

On a global scale, using 94 populations across the world, several variants showed patterns which strongly correlated with longitude, as opposed to latitude (Table 2; Fig. 1). Mantel correlation between FST (Weir & Cockerham, 1984) and geographic distance with the inclusion of obligatory waypoints is between 0.2 and 0.4 (p < 10−3), for the alleles listed in Table 2 and which show frequency differences along latitudinal or longitudinal gradients (Table 2). Many of the loci follow clear longitudinal gradients across Eurasia (Fig. 1). Spearman rank correlations between longitude, latitude, and allele frequencies in the Eurasian subset of the 94 global populations generally followed the same trend as the global populations, perhaps because most of the populations included in the global analysis came from Eurasia (Table 2). However, correlation between FOXO3A rs13220810 C and latitude disappeared, and instead correlation between MC4R rs12970134A and longitude became significant (Fig. 1; Table 2).

Table 2. Spearman rank correlation between allele frequencies, latitude, and longitude, with Bonferroni-corrected p-values
   Spearman rank correlation 
   LatitudeLongitudeMantel test
Geographic levelSNPAlleleρpcorrρpcorrr2p
  1. Some of these values are also displayed in Figure 1. The numbers in bold reflect statistically significant correlations. Mantel correlations between FST and geographic distance are also given for all the world populations, as well as the India sequenom groups. For most populations, Mantel correlations between FST and geographic distance range between 0.2 and 0.4. These correlations are lower than the Mantel correlation of 0.8851 reported by Ramachandran et al. (2005) in an analysis of 783 microsatellites in 53 populations, 49 of which form a subset of the 94 global populations in this study (Ramachandran et al., 2005; Table S2).

NRXN3rs10146997 (A>G)G−0.141−0.73<10−40.2870.0001
RAPGEF4rs1349498 (G>A)A0.2210.63<10−40.1520.0134
APOBrs1713222 (C>T)T−0.081−0.54<10−40.2630.0003
NFE2L2rs17647588 (C>T)T0.250.936−0.49<10−40.3990.0001
FOXO3Ars13220810 (T>C)C0.380.015−0.0910.2330.0007
ESR1rs985694 (C>T)T−0.0810.74<10−40.1920.0012
BRCA1rs9911630 (G>A)A0.63<10−4−0.0410.3090.0024
THADArs7578597 (T>C)C−0.11−0.350.0410.2270.0263
TCF7L2rs7903146 (C>T)T−0.121−0.49<10−40.3860.0001
KCNQ1rs2237892 (C>T)C−0.011−0.330.0920.425<10−4
NRXN3rs10146997 (A>G)G0.031−0.75<10−4  
RAPGEF4rs1349498 (G>A)A0.2710.72<10−4  
APOBrs1713222 (C>T)T−0.131−0.66<10−4  
NFE2L2rs17647588 (C>T)T0.151−0.73<10−4  
ESR1rs985694 (C>T)T−0.210.78<10−4  
BRCA1rs9911630 (G>A)A0.72<10−40.011  
THADArs7578597 (T>C)C−0.111−0.450.003  
TCF7L2rs7903146 (C>T)T−0.231−0.64<10−4  
KCNQ1rs2237892 (C>T)C−0.261−0.58<10−4  
MC4Rrs12970134 (G>A)A−0.261−0.40.021  
KCNQ1rs2237892 (C>T)C−0.510.005−0.261−0.0320.395
India language groups        
KCNQ1rs2237892 (C>T)C−1<10−4−0.5<10−4  
THADArs7578597 (T>C)C−0.5<10−4−1<10−4  
NRXN3rs10146997 (A>G)G−0.5<10−4−1<10−4  
JAK1rs11208534 (A>G)G0.5<10−4−1<10−4  
Figure 1.

Distribution of allele frequencies across global populations. The line above each map specifies the allele and the line below the map gives the Spearman rank correlation coefficient and Bonferroni-corrected p-value for significance of correlation between latitude (“lat”) and longitude (“lon”) and allele frequency. The reference alleles for T2D-associated TCF7L2, THADA, KCNQ1, and obesity-associated NRXN3 and THADA are risk alleles for the diseases. Global patterns and correlations for 11 out of the 18 SNPs are shown here because they were statistically significant. The two exceptions are MC4R rs12970134 and KCNQ1 rs2237892, which may be significant within India (Fig. 2).

Patterns of Variation along the Indian Cline

The 62 Indian populations genotyped on the Sequenom platform show comprehensive geographic distribution across India, particularly on the north-south axis, allowing us to test for evidence of T2D- and obesity-associated allele frequency patterns following the northwest-southeast Indian cline revealed by genome-wide patterns (Reich et al. 2009) and the north-south cline in T2D prevalence (Fig. S1). Spearman rank correlations estimated on a set of seven Indian populations (excluding Austro-Asiatic and Sino-Tibetan Indians) and five non-Indian population groups (“Caucasus,” “Central Asia,” “Europe,” “Near East,” “Pakistan,”; Table S3) showed significant correlations between frequencies of SNPs NRXN3 rs10146997 (ρ = −0.83, pcorr = 0.0142), NFE2L2 rs17647588 (ρ = −0.82, pcorr = 0.0181) and ESR1 rs985694 (ρ = 0.85, pcorr = 0.0062) and longitude. None of these SNPs showed a significant correlation with latitude. In accordance with known geographic patterns in skin pigmentation, however, both ESR1 and BRCA1 showed geographic patterning along the Indian cline (Table 3), and BRCA1 additionally correlated with latitude in global populations (Table 2; Jablonski & Chaplin, 2000). Mantel correlations, however, revealed a strong correlation between FST and geographic distance for NFE2L2, and ESR1, corresponding also to strong correlation between minor allele frequency and longitude (Table 3). Interestingly, KCNQ1 shows negative Mantel correlation across the Indian cline, suggesting less genetic diversity with increased geographic distance. This result stands in contrast to the Mantel correlation estimated in the global analysis in which populations were not grouped by geographic region and involved a larger geographic range of populations (Table 3). The difference could be due to the nature of the population groupings in the Indian-cline analysis: populations in larger geographic regions (World) were grouped together, while populations at smaller geographic scales were left intact (India; Table 3). The grouping scheme used here suggests that correlations with geography may be significant at a macrolevel scale of population sampling, but may not be strong enough to reach significance at a microlevel scale.

Table 3. Spearman rank and Mantel correlations among populations included in the “Indian cline” analysis
   Spearman rank correlation 
   LatitudeLongitudeMantel test
  1. The “World” populations show the groupings used for outside India populations, and “India” are the populations from India included in the analysis. India cline populations include the “World” populations grouped into “Caucasus,” “Central Asia,” “Pakistan,” “Near East,” and “Europe.” The Indian populations are grouped into “UP Brahmins,” “Central India tribe,” “Gujaratis,” “North India caste,” “North India tribe,” “South India caste,” and “South India tribe.” Details on the populations included in these groupings are provided in Table S3.

NRXN3rs10146997 (A>G)G0.361−0.830.0140.350.026
MC4Rrs12970134 (G>A)A−0.091−0.241−0.110.255
KCNQ1rs2237892 (C>T)C−0.311−0.231−0.220.028
THADArs7578597 (T>C)C−0.031−
TCF7L2rs7903146 (C>T)T−0.161−0.511−0.120.299
PPARGrs6802898 (C>T)C−0.2410.291−0.040.476
PAX4rs10229583 (G>A)A−0.531−0.0710.220.116
PPARArs12330015 (A>G)G0.680.605−0.0310.120.198
RAPGEF4rs1349498 (G>A)A0.1110.4310.050.318
APOBrs1713222 (C>T)T0.361−0.431−0.100.376
NFE2L2rs17647588 (C>T)T0.541−0.820.0180.86<10−4
FOXO3Ars13220810 (T>C)C0.351−0.591−0.160.161
ESR1rs985694 (C>T)T−0.5110.850.0060.81<10−4
BRCA1rs9911630 (G>A)A0.770.095−0.2910.400.012

Patterns of Variation within India

While many of the SNPs presently studied showed geographic patterns that mirrored latitudinal or longitudinal gradients on a global scale, these were absent or less pronounced in the Indian populations. Furthermore, Mantel test results show higher correlation of geographic and genetic distance at a global level than within India (Table 2). The pattern found here is consistent with frequencies of other disease-associated variants, which appear to vary along a latitudinal cline in world populations but not within India (Pemberton et al., 2008). Only a weak Mantel correlation between FST at KCNQ1 rs2237892 C and geographic distance was observed. This correlation may be due to the inclusion of the Sino-Tibetan language-speaking Nyshi and Ao Naga populations of Northeast India, which show dramatically lower risk allele frequencies compared to populations in the rest of India, wherein the allele is close to or at fixation (Fig. 2).

Figure 2.

Distribution of rs12970134 A, rs2237892 C, rs7578597 C, and rs11208534 G alleles within India. Spearman rank correlation coefficients between rs12970134 allele frequency, latitude, and longitude were not significant on a global scale or within India. Red dots on the world map represent populations. Populations within India, showing unusual allele frequency differences, labeled in red speak Austro-Asiatic languages, those in blue speak Indo-European languages, green speak Sino-Tibetan languages, black speak Dravidian languages, and brown are a linguistic isolate (Nihali). For KCNQ1, the colors are reversed in the within-India group for clarity. Spearman rank correlations are provided both across all Indian populations, as well as populations grouped by language family. For the JAK1 locus, only Spearman rank correlation within India is provided, because this locus was unavailable in the global dataset.

Thus far, few studies have examined genome-wide variation among Indian populations by their geography (Reich et al., 2009; Metspalu et al., 2011). The Reich et al. (2009) study estimated genome-wide FST to be 0.01 among Indian populations, excluding Sino-Tibetans and other outlying populations, about three times higher than among European populations. When they adjusted their estimate to account for the effects of inbreeding, which could inflate differences between populations, the FST value decreased to 0.0069. To provide a comparative estimate based on the Illumina samples used in this study, we also calculated FST between north and south Indian population groups at 9942 SNPs sampled randomly from the genome (Table S5).

Pairwise FST differences between north and south Indian population groups in our genome-wide dataset showed values resembling the Reich et al. (2009) inbreeding-adjusted estimate from the Affymetrix data, although the FST values calculated between north and south Indian population groups from the Illumina data were not adjusted for inbreeding.

Grouping by State of Origin

We evaluated allele frequency differences among populations grouped based on Indian state of origin to test for fine-scale patterns of local differentiation at alleles associated with T2D and obesity, although Spearman rank correlations with latitude and longitude and Mantel correlations with geographic distance did not reveal strong geographic patterns of allele frequencies across India among these groups. Examining patterns of allele frequency across Indian states may indicate a latitudinal cline in obesity-associated MC4R SNP rs12970134, although Spearman rank correlation did not reveal significant correlation with latitude (Table S7; Fig. 2). Comparisons of FST between Indian state groups and global populations show lower allele frequency differences among Indian states than among global regions across the 14 SNPs. However, FST differences corresponding to state groups are similar to the genome-wide FST value of 0.01 (Reich et al. 2009) suggesting that the studied obesity and T2D risk alleles as a group do not show reduced diversity as expected from their disease association (Fig. 3).

Figure 3.

To compare the influence of both population groupings and population sampling at the level of India and also of Karnataka, we calculated FST values for: (A) all 62 populations, ungrouped; (B) all Indian populations minus the Sino-Tibetan populations; (C) all non-Karnataka Indian populations, only one Karnataka population (Gangadikaara Vokkaliga) and no Sino-Tibetan populations; and (D) Karnataka populations only. We also included published values of FST among Indians (Reich et al., 2009). Figure 3(A) shows FST differences among 15 of the 18 SNPs studied, and Figure 3(B) shows FST differences among the remaining three SNPs; these were separated because they show extreme FST differences between the different population grouping schemes. FST values are provided in Table S9.

Language Family Grouping

Language families also show geographic clustering in India (Reich et al., 2009; Gallego Romero et al., 2011). Therefore, we carried out Spearman rank correlations tests between allele frequency, latitude, and longitude among populations grouped by language family within India. Strong correlations were observed for four alleles (Table 2; Fig. 2). Of these four alleles, KCNQ1 rs2237892 and THADA rs7578597 were associated with T2D, NRXN3 rs10146997 with waist circumference, and JAK1 rs11208534 was found to be undergoing natural selection within India, based on the iHS statistic. When populations were regrouped according to state of residence, none of the allele frequencies correlated with latitude or longitude. This finding is consistent with previous studies based on regional languages, which approximately follow state boundaries, in India (Pemberton et al., 2008). Alternatively, all states were not comprehensively sampled across caste and tribe boundaries; while south Indian states were heavily represented by several castes and tribes, many north and northeast Indian states were only represented by two populations, the Ao Naga and Nyshi, which may have reduced our power to detect geographic patterns among Indian states.

FST was higher among language family groups than state groups, except for SNPs NFE2L2 rs17647588, MC4R rs17782313, THADA rs7578597, and ESR1 rs985694. Higher FST values may be attributable to strong differentiation between Sino-Tibetan populations and other Indian populations. Grouping populations by language family confirms strong differences between Sino-Tibetan populations and other Indian populations at almost all loci (Tables S7 and S8). Austro-Asiatic speakers sometimes show intermediate allele frequencies between Sino-Tibetan and Indo-European speakers, in accordance with their geographic distribution and demographic history involving some gene flow from the southeast (Chaubey et al., 2011; Table 2; Fig. 2). Average FST values for each SNP vary widely, from a minimum of 0.002 at ESR1 rs985694 to a maximum of 0.277 for KCNQ1 rs2237892 (Table S8). Variance at FST values among linguistic groups was slightly higher than variance among global groups, for the same set of 14 SNPs (global variance: 5.08 × 10−3, language group variance: 5.1 × 10−3). The high KCNQ1 FST value is attributable to the inclusion of the Sino-Tibetan populations, in which the risk allele frequency of 0.66 is identical to the risk allele frequency in the East Asian population group (Tables S6 and S8).

Population Exclusions and Patterns of Variation within Karnataka State

Variation among Indian states was highest when Sino-Tibetan populations were included, especially for CDKN2A/B, RAPGEF4, and KCNQ1 (Fig. 3). Grouping the 62 Indian populations by geographic region (e.g. State) did not always reveal large variation among populations (Fig. 3). On the other hand, substantial variation in allele frequencies among groups sampled at a fine-scale geographic level (e.g. within Karnataka only) suggests that these methods of grouping populations may be inadequate for accurately representing population variation (Raj et al., 2006, 2007).

In all except two instances, FST differences among Indian populations increased upon removal of all Karnataka populations except the Gangadikaara Vokkaliga population, which was chosen to represent Karnataka in the State-level analyses because it is one of the largest populations in Karnataka. However, removing nearly all of the Karnataka populations only had a minor impact on state-wide FST values compared with just removing Sino-Tibetan populations but keeping all Karnataka populations. Therefore, the Gangadikaara Vokkaliga population as one of the most common populations in Karnataka serves as a good representative of Karnataka population genetic variation and grouping populations based on state of origin may buffer against large vacillations in allele frequency across populations.

FST estimates within Karnataka populations were highly variable; at SNPs rs10229583, rs12330015, rs17647588, rs6802898, and rs985694 FST of Karnataka populations were higher than all other population groups, including FST of all Indian populations including Sino-Tibetan speakers. At T2D-associated locus PPARG rs6802898, the large FST value for Karnataka populations may be attributed to the Havyak and Arya Vaisya rural and urban populations having substantially lower risk allele frequency, at an average of 22 percentage points lower than other Karnataka populations. At loci such as rs10811661 and rs2237892, however, FST values of Karnataka populations were lower than among all other population groups, suggesting greater uniformity in allele frequency among populations within Karnataka at these loci (Fig. 3; Table S9). The observed high degrees of variability at disease-associated loci at a fine-scale geographic level (e.g. within Karnataka populations only) suggests that studies designed to investigate T2D and obesity risk, and also perhaps other complex diseases, in Indians must match cases and controls at a fine geographic scale.


We studied the distribution of allelic variation at T2D- and obesity-associated loci in India to: (1) test if genetic variation at these loci mirrored the nation-wide distribution of obesity and T2D prevalence, including the variation at loci which have been identified as candidates of positive selection and (2) to test whether measures of population differentiation varied among groups.

We found that T2D- and obesity-associated alleles that show geographic variation on a global scale show less pronounced or no geographic patterning in India, inconsistent with known geographic variation in T2D and obesity prevalence in India. The appearance of predominantly longitudinal as opposed to latitudinal correlations of allele frequencies in the global dataset, and in the restricted Eurasian dataset, as well as statistically significant Mantel correlations between FST and geographic distance confirms established reports of a correlation between genetic and geographic distance at a broad geographic scale (Prugnolle et al., 2005; Ramachandran et al., 2005; Betti et al., 2009). Pairwise FST differences between north and south Indian population groups in our genome-wide dataset also showed values resembling the Reich et al. (2009) inbreeding-adjusted estimate, although our FST values calculated between north and south Indian population groups were not adjusted for inbreeding. There are several possible reasons for this discrepancy, including: (1) the SNPs on which FST differences are based represent nonrandom variation, (2) the Illumina population groups do not reflect all the geographic regions within India covered by the Reich et al. (2009) study, and (3) grouping several populations into north and south Indian population groups significantly impacts measures of FST.

Comparisons of allele frequencies across India and within a single state in India suggest that for some variants, differences within and among populations may be the same or greater within a single state than across India, and the degree of variation may depend on population sampling and grouping schemes. Across almost all alleles, inclusion of the Sino-Tibetan speaking populations created inflated estimates of variation (as measured by FST and AMOVA). Sino-Tibetan speaking populations are known to share closer ancestry with East Asian populations than with South Asians, which may explain this result. Excluding Sino-Tibetan populations from the analyses, however, did not drastically reduce variation at the loci. We employed the same strategy for Illumina samples, in which only one or two individuals were sampled from a single, endogamous population. The grouping scheme may not have provided accurate results, however, as genome-wide estimates of FST fell two orders of magnitude below published estimates of Indian FST values (Table S5). Alternatively, the randomly chosen alleles used to estimate genome-wide FST in the Illumina samples may not truly represent neutral variation.

Whether the SNPs investigated in this dataset represent neutral variation, disease-associated variation, or variants under selection in Indian populations may also influence patterns of genetic variation. Lack of correlation between allelic variation and T2D and obesity prevalence trends suggests that either these trends are influenced more by environmental than genetic factors, or by other SNPs that are yet to be determined. Association studies in Indian populations may suggest other variants that better explain T2D and obesity in Indian populations. Furthermore, most of the disease-associated alleles examined here also do not follow previously published patterns of neutral variation in India, referred to here as the “Indian cline,” following a gradient in allele frequency variation from Europe to India (Reich et al. 2009). These results may not be entirely surprising, as not all neutral or disease-associated SNPs will be expected to follow the same geographic pattern. SNPs in the obesity-associated NRXN3 and T2D-associated KCNQ1 genes, however, somewhat follow the “Indian cline” (Table 3), although overall, less geographic variation was observed within India than across global populations. As already mentioned, these patterns could be due to either strong effects of selection, or drift at individual loci; note, however, that neither NRXN3 nor KCNQ1 that followed the geographic trend expected from genome-wide average data appeared to be under selection based on the local partial sweep iHS statistic. However, founder effects and genetic drift may be more pronounced in Indian populations than in other populations because many of them have a characteristically small size and high levels of endogamy (Reich et al., 2009).

We did not find any significant clinal patterns with PPARG variant rs6802898, chosen for genotyping because of its high iHS score ranking in the Indian populations. Unlike other variants that were selected as T2D candidates from Gaulton et al. (2008), rs6802898 is an intronic SNP in the PPARG gene, in which the Pro12Ala variant has been previously reported to be associated with T2D and obesity in Indian populations (Sanghera et al., 2010; Vimaleswaran et al., 2010; Prakash et al., 2012). The variant genotyped here was not previously reported to be associated with T2D. While some sharing of variants associated with T2D and obesity exists between European and Indian populations, there are a number of variants, which are associated only in Indian populations. Recent studies have identified new loci associated with T2D in South Asians, not previously found to be associated with T2D in other populations (Vimaleswaran et al., 2010; Kooner et al., 2011; Tabassum et al., 2012). It remains to be tested whether these newly discovered variants correlate with the north to south geographic patterns in India.

The sampling strategy used here comprehensively represented Indian populations on the north-south axis, but the east-west axis was less well-covered. Future studies may benefit from increased genetic information on Indian populations, additional studies to identify markers that specifically influence diabetes and obesity in Indians, and wider geographic sampling to gain a more complete understanding of the relationship among genetic and geographic variation.


We would like to thank all the participants for providing saliva samples for the DNA analysis, and over 80 individuals and organizations that helped in the process. In particular, the authors would like to acknowledge Mr. and Mrs. H. B. Rajagopal, Mahadeva, Mrs. Poornima Rangappa and Mr. Girijashankar for their help in coordinating sample collection. Maggie Bellatti, Krishnendu Khan, Jasbeer Singh, Charles Spurgeon, and Kranthi Kumar provided support in the laboratory. Drs Gabriel Amable and Paco Bertolani assisted in the generation of the interpolated maps. Finally, funding for this work came from the UK-India Education and Research Initiative, Gates Cambridge Trust, Centre for Human Genetics and Indian Institute of Science (Bangalore, India), the Bridget's Trust, Gonville and Caius College, the Cambridge-India Partnership Fund, as well as CardioMed-BSC0122 of Council of Scientific and Industrial Research (CSIR), Government of India.