Evolutionary history of disease‐susceptibility loci identified in longitudinal exome‐wide association studies
Abstract
Background
Our longitudinal exome‐wide association studies previously detected various genetic determinants of complex disorders using ~26,000 single‐nucleotide polymorphisms (SNPs) that passed quality control and longitudinal medical examination data (mean follow‐up period, 5 years) in 4884–6022 Japanese subjects. We found that allele frequencies of several identified SNPs were remarkably different among four ethnic groups. Elucidating the evolutionary history of disease‐susceptibility loci may help us uncover the pathogenesis of the related complex disorders.
Methods
In the present study, we conducted evolutionary analyses such as extended haplotype homozygosity, focusing on genomic regions containing disease‐susceptibility loci and based on genotyping data of our previous studies and datasets from the 1000 Genomes Project.
Results
Our evolutionary analyses suggest that derived alleles of rs78338345 of GGA3, rs7656604 at 4q13.3, rs34902660 of SLC17A3, and six SNPs closely located at 12q24.1 associated with type 2 diabetes mellitus, obesity, dyslipidemia, and three complex disorders (hypertension, hyperuricemia, and dyslipidemia), respectively, rapidly expanded after the human dispersion from Africa (Out‐of‐Africa). Allele frequencies of GGA3 and six SNPs at 12q24.1 appeared to have remarkably changed in East Asians, whereas the derived alleles of rs34902660 of SLC17A3 and rs7656604 at 4q13.3 might have spread across Japanese and non‐Africans, respectively, although we cannot completely exclude the possibility that allele frequencies of disease‐associated loci may be affected by demographic events.
Conclusion
Our findings indicate that derived allele frequencies of nine disease‐associated SNPs (rs78338345 of GGA3, rs7656604 at 4q13.3, rs34902660 of SLC17A3, and six SNPs at 12q24.1) identified in the longitudinal exome‐wide association studies largely increased in non‐Africans after Out‐of‐Africa.
1 INTRODUCTION
Recent genome‐wide association studies (GWASs) have identified various genetic variants that confer susceptibility to hypertension (International Consortium for Blood Pressure Genome‐Wide Association Studies et al., 2011; Kato et al., 2011; Levy et al., 2009; Liu et al., 2016; Newton‐Cheh et al., 2009; Simino et al., 2014), chronic kidney disease ([CKD]; Böger et al., 2011; Chambers et al., 2010; Okada, Sim, et al., 2012; Pattaro et al., 2016), hyperuricemia (Chittoor et al., 2016; Köttgen et al., 2013; Matsuo et al., 2008; Nakayama et al., 2017; Phipps‐Green et al., 2016), type 2 diabetes mellitus ([T2DM]; Cho et al., 2012; Hara et al., 2014; Imamura & Maeda, 2011; Imamura et al., 2016; Mahajan et al., 2014; Matsuba et al., 2016; Morris et al., 2012), obesity (Dorajoo et al., 2017; Fall & Ingelsson, 2014; Locke et al., 2015; Okada, Kubo, et al., 2012; Scott et al., 2016), metabolic syndrome ([MetS]; Fall & Ingelsson, 2014; Kraja et al., 2011; Povel, Boer, Reiling, & Feskens, 2011; Zabaneh & Balding, 2010), dyslipidemia [hypertriglyceridemia, hyper‐low density lipoprotein (LDL)‐cholesterolemia, and hypo‐high density lipoprotein (HDL)‐cholesterolemia] (Aguilar‐Salinas, Tusie‐Luna, & Pajukanta, 2014; Asselbergs et al., 2012; Barber et al., 2010; Jia et al., 2016; Kathiresan et al., 2009; Kooner et al., 2008; Kurano et al., 2016; Lange et al., 2014; Teslovich et al., 2010), coronary artery disease ([CAD]; Golbus et al., 2016; Lu et al., 2012; Nikpay et al., 2015; Schunkert et al., 2011; Wakil et al., 2016), and cerebral infarction (CI; Akinyemi et al., 2015; Kubo et al., 2007; Meschia et al., 2011; Yamada et al., 2009) in diverse ethnic groups. In addition, genetic variants associated with hematological traits have been identified by previous GWASs (Kamatani et al., 2010; Li et al., 2013; Lo et al., 2011; Mousas et al., 2017; Okada & Kamatani, 2012; van Rooij et al., 2017). Information on the association of numerous genetic variants with various disease‐related traits is publicly available in several databases such as Genome‐Wide Repository of Associations Between SNPs and Phenotypes (GRASP) (https://grasp.nhlbi.nih.gov/Overview.aspx; Leslie, O'Donnell, & Johnson, 2014), GWAS Catalogue (https://www.ebi.ac.uk/; MacArthur et al., 2017), and DisGeNET (http://www.disgenet.org/web/DisGeNET/; Piñero et al., 2015). Most conventional GWASs, however, have been conducted in a cross‐sectional manner that measured traits at a single point in time.
In our previous studies, therefore, longitudinal exome‐wide association studies were performed to explore novel genetic determinants of complex disorders (hypertension, T2DM, dyslipidemia, CAD, CI, CKD, hyperuricemia, obesity, and MetS) as well as 13 hematological traits in 4884–6022 Japanese subjects who had undergone annual health checkups for several years (Yasukochi et al., 2017, 2018a, 2018b, 2018c, 2018d, 2018e, 2019). In these studies, we identified 272 single‐nucleotide polymorphisms (SNPs) associated with complex disorders or hematological markers, using ~26,000 SNPs that passed quality control and longitudinal medical examination data of the Japanese subjects. Of these SNPs, 28 (one SNP for hypertension, two for T2DM, two for hyper‐LDL‐cholesterolemia, four for hypo‐HDL‐cholesterolemia, three for CAD, three for CI, one for hyperuricemia, three for obesity, and nine for hematological traits) were identified as novel genetic determinants (Table 1). A whole genome sequence analysis facilitates to account for the missing heritability of complex traits and diseases (Wainschtein et al., 2019) because it covers coding and noncoding variants, including rare variants in regions of low linkage disequilibrium (LD). The generalized estimating equation method implemented in our previous studies can increase the chance of type I error due to a small effective sample size (Sitlani et al., 2015). After quality controls for the genotyping data in the longitudinal exome‐wide association studies, SNPs with the small effective sample size (minor allele frequency [MAF] of <5%) were filtered out. Nevertheless, our longitudinal exome‐wide association studies identified 28 SNPs that were not found in previous GWASs (Yasukochi et al., 2017, 2018a, 2018b, 2018c, 2018d, 2018e, 2019). This suggests that the longitudinal exome‐wide association study is useful for finding the missing heritability of complex traits and diseases even though rare variants are removed.
| Disease or Trait | RefSNP ID | Alleleaa
Major allele → minor allele.
|
Positionbb
Position in NCBI build GRCh37.p13.
|
Positioncc
Position in NCBI build GRCh38.p10.
|
Gene or locus | Amino acid changedd
Splice variants of amino acid substitution are not shown.
|
Reference |
|---|---|---|---|---|---|---|---|
| Hypertension | rs11917356 | G → A | 3:130110550 | 3:130391707 | COL6A5 | G982D | Yasukochi et al., 2017 |
| Hyperuricemia | rs55975541 | G → A | 11:64597201 | 11:6482929 | CDC42BPG | R1237W | Yasukochi et al., 2018a |
| Hematological traits | rs3917688 | G → A | 1:169591080 | 1:169621842 | SELP | Yasukochi et al., 2018b | |
| rs7584099 | G → A | 2:148478336 | 2:147720767 | 2q22.3 | |||
| rs4686683 | C → A | 3:185307363 | 3:185589575 | SENP2 | |||
| rs13121954 | G → A | 4:148023829 | 4:147102677 | 4q31.2 | |||
| rs395967 | A → G | 5:38842959 | 5:38842857 | OSMR‐AS1 | |||
| rs1579219 | G → A | 6:30224305 | 6:30256528 | HCG17 | |||
| rs12338 | G → C | 8:11710888 | 8:11853379 | CTSB | L26V | ||
| rs3133745 | C → T | 8:96534806 | 8:95522578 | C8orf37‐AS1 | |||
| rs10757049 | A → G | 9:19281501 | 9:19281503 | DENND4C | |||
| Dyslipidemia | rs34902660 | C → A | 6:25851102 | 6:25850874 | SLC17A3 | G239V | Yasukochi et al., 2018c |
| rs1042127 | T → G | 6:31084170 | 6:31116393 | CDSN | S408A | ||
| rs74416240 | A → G | 12:110342598 | 12:109904793 | GIT2 | N387S | ||
| rs925368 | G → A | 12:110390979 | 12:109953174 | TCHP | E152E | ||
| rs7969300 | T → C | 12:111993712 | 12:111555908 | ATXN2 | N248S | ||
| rs12231744 | C → T | 12:112477055 | 12:112039251 | NAA25 | R876K | ||
| Obesity | rs9491140 | C → T | 6:124691237 | 6:124370091 | NKAIN2 | Yasukochi et al., 2018d | |
| rs145848316 | C → A | 7:151882672 | 7:152185587 | KMT2C | A1685S | ||
| rs7863248 | T → C | 9:88308127 | 9:85693212 | AGTPBP1 | |||
| CAD/CI | rs4606855 | G → C | 19:14769339 | 19:14658527 | ADGRE3 | E75Q | Yasukochi et al., 2018e |
| rs7132908 | G → A | 12:50263148 | 12:49869365 | FAIM2 | |||
| rs6580741 | G → C | 12:50727706 | 12:50333923 | FAM186A | H2228Q | ||
| rs1324015 | G → A | 13:43727849 | 13:43153713 | LINC00400 | |||
| rs884205 | G → T | 18:60054857 | 18:62387624 | TNFRSF11A | |||
| rs3746414 | G → A | 20:50769379 | 20:52152840 | ZFP64 | S451N | ||
| T2DM | rs6414624 | C → T | 4:5743512 | 4:5741785 | EVC | H258Y | Yasukochi et al., 2019 |
| rs78338345 | C → G | 17:73238509 | 17:75242428 | GGA3 | E147Q |
- Abbreviations: CAD, coronary artery disease. CI, cerebral infarction; T2DM, type 2 diabetes mellitus.
- a Major allele → minor allele.
- b Position in NCBI build GRCh37.p13.
- c Position in NCBI build GRCh38.p10.
- d Splice variants of amino acid substitution are not shown.
Of the 28 SNPs identified in our longitudinal exome‐wide association studies, allele frequencies of some disease‐susceptibility loci are clearly different between Africans (AFRs) and non‐AFRs, according to allele frequency data of four ethnic populations from the 1000 Genomes Project (1KGP; The 1000 Genomes Project Consortium, 2010). This suggests that derived or ancestral allele frequencies of disease‐susceptibility loci may have increased after the divergence of non‐AFR and AFR ancestral populations. Approximately 135,000–40,000 years ago, anatomically modern humans dispersed throughout the world from Africa (Gronau, Hubisz, Gulko, Danko, & Siepel, 2011; Groucutt et al., 2015; Li & Durbin, 2011; López, van Dorp, & Hellenthal, 2015; Olivieri et al., 2006; Reyes‐Centeno, Hubbe, Hanihara, Stringer, & Harvati, 2015; Soares et al., 2009; Wall, 2017), although the dating remains controversial because of a recent finding of older anatomically modern human fossils (194,000–177,000 years ago) outside Africa (Hershkovitz et al., 2018). These migrations are called “Out‐of‐Africa”. After the exodus from Africa, positive selection has acted on several genomic regions in non‐AFRs to adapt to new environments (Bigham et al., 2010; Crawford et al., 2017; Pickrell et al., 2009; Sabeti et al., 2007; Simonson, Huff, Witherspoon, Prchal, & Jorde, 2015; Voight, Kudaravalli, Wen, & Pritchard, 2006). However, it is possible that positively selected variants might increase susceptibility to complex disorders in return for environmental adaptability (Okumiya et al., 2010). In the present study, we have estimated the evolutionary history of genomic regions around disease‐associated loci identified in our previous studies. These evolutionary inferences may be helpful in elucidating how and why allelic variants conferring susceptibility to complex disorders have been disseminated in modern humans.
2 MATERIALS AND METHODS
2.1 Ethics statement
The study protocol complies with the Declaration of Helsinki and was approved through the Committees on the Ethics of Human Research of Mie University Graduate School of Medicine and Inabe General Hospital. Written informed consent was obtained from all subjects prior to enrollment in the present study.
2.2 Datasets
In previous longitudinal exome‐wide association studies (Yasukochi et al., 2017, 2018a, 2018b, 2018c, 2018d, 2018e, 2019), 6026 community‐dwelling individuals were recruited from population‐based cohort studies (Inabe Health and Longevity Study) in Inabe City, Mie, Japan (Oguri et al., 2017; Yamada, Matsui, Takeuchi, & Fujimaki, 2015a, 2015b; Yamada, Matsui, Takeuchi, Oguri, & Fujimaki, 2015a, 2015b). These individuals visited Inabe General Hospital for an annual health checkup, with a mean follow‐up period of 5 ± 3 years (covering April 2003–March 2014). We refer to this cohort as “Inabe cohort”, and used SNP data genotyped by Infinium HumanExome‐12 ver. 1.2 BeadChip and Infinium Exome‐24 ver. 1.0 BeadChip (Illumina) (Grove et al., 2013) in our previous studies to determine their allele frequencies. The frequencies in the Inabe cohort are listed in Table 2 and Table S1.
| Disease | RefSNP ID | Positionaa
Position in NCBI build GRCh37.p13.
|
EAS | SASbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
EURbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
AFRbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| All | JP‐Inabe | JPTbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
CDXbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
CHBbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
CHSbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
KHVbb
Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
|
||||||
| Hypertension‐Hyperuricemia‐Dyslipidemia | rs12229654 | 12:111414461 | T: 0.777 (13,319) | T: 0.747 (9,001) | T: 0.837 (174) | T: 0.957 (178) | T: 0.835 (172) | T: 0.724 (152) | T: 0.869 (172) | T: 1.000 (978) | T: 1.000 (1,006) | T: 1.000 (1,322) |
| G: 0.223 (3,829) | G: 0.253 (3,043) | G: 0.163 (34) | G: 0.043 (8) | G: 0.165 (34) | G: 0.276 (58) | G: 0.131 (26) | ||||||
| rs3782886 | 12:12110489 | A: 0.719 (12,323) | A: 0.682 (8,220) | A: 0.750 (156) | A: 0.957 (178) | A: 0.845 (174) | A: 0.729 (153) | A: 0.864 (171) | A: 1.000 (978) | A: 1.000 (1,006) | A: 1.000 (1,322) | |
| BRAP | G: 0.281 (4,825) | G: 0.318 (3,824) | G: 0.250 (52) | G: 0.043 (8) | G: 0.155 (32) | G: 0.271 (57) | G: 0.136 (27) | |||||
| rs11066015 | 12:112168009 | G: 0.735 (12,582) | G: 0.700 (8,425) | G: 0.864 (171) | G: 0.957 (178) | G: 0.840 (173) | G: 0.719 (151) | G: 0.760 (158) | G: 1.000 (978) | G: 1.000 (1,006) | G: 1.000 (1,322) | |
| ACAD10 | A: 0.265 (4,548) | A: 0.300 (3,619) | A: 0.136 (27) | A: 0.043 (8) | A: 0.160 (33) | A: 0.281 (59) | A: 0.240 (50) | |||||
| rs671 | 12:112241766 | G: 0.714 (12,353) | G: 0.699 (8,422) | G: 0.760 (158) | G: 0.957 (178) | G: 0.840 (173) | G: 0.729 (153) | G: 0.864 (171) | G: 1.000 (978) | G: 1.000 (1,006) | G: 0.998 (1,320) | |
| ALDH2 | A: 0.286 (4,960) | A: 0.301 (3,622) | A: 0.240 (50) | A: 0.043 (8) | A: 0.160 (33) | A: 0.271 (57) | A: 0.136 (27) | A: 0.002 (2) | ||||
| rs2074356 | 12:112645401 | C: 0.757 (12,991) | C: 0.722 (8,696) | C: 0.788 (164) | C: 0.968 (180) | C: 0.898 (185) | C: 0.819 (172) | C: 0.894 (177) | C: 1.000 (978) | C: 1.000 (1,006) | C: 1.000 (1,322) | |
| HECTD4 | T: 0.243 (4,164) | T: 0.278 (3,348) | T: 0.212 (44) | T: 0.032 (6) | T: 0.102 (21) | T: 0.181 (38) | T: 0.106 (21) | |||||
| rs11066280 | 12:112817783 | T: 0.719 (12,326) | T: 0.687 (8,272) | T: 0.764 (159) | T: 0.866 (161) | T: 0.811 (167) | T: 0.705 (148) | T: 0.783 (155) | T: 0.996 (974) | T: 1.000 (1,006) | T: 0.999 (1,321) | |
| HECTD4 | A: 0.281 (4,824) | A: 0.313 (3,772) | A: 0.236 (49) | A: 0.134 (25) | A: 0.189 (39) | A: 0.295 (62) | A: 0.217 (43) | A: 0.004 (4) | A: 0.001 (1) | |||
| T2DM | rs78338345 | 17:73238,09 | C: 0.888 (11,586) | C: 0.886 (10,670) | C: 0.861 (179) | C: 0.973 (181) | C: 0.869 (179) | C: 0.914 (192) | C: 0.934 (185) | C: 0.998 (976) | C: 1.000 (1,006) | C: 1.000 (1,322) |
| GGA3 | G: 0.112 (1,464) | G: 0.114 (1,372) | G: 0.139 (29) | G: 0.027 (5) | G: 0.131 (27) | G: 0.086 (18) | G: 0.066 (13) | G: 0.002 (2) | ||||
| Obesity | rs7656604 | 4:72547436 | G: 0.921 (11,997) | G: 0.916 (11,031) | G: 0.923 (192) | G: 0.989 (184) | G: 0.971 (200) | G: 0.957 (201) | G: 0.955 (189) | G: 0.947 (926) | G: 0.994 (1,000) | G: 0.200 (264) |
| A: 0.079 (1,029) | A: 0.084 (1,013) | A: 0.077 (16) | A: 0.011 (2) | A: 0.029 (6) | A: 0.043 (9) | A: 0.045 (9) | A: 0.053 (52) | A: 0.006 (6) | A: 0.800 (1,058) | |||
| Dyslipidemia | rs34902660 | 6:25851102 | C: 0.907 (11,831) | C: 0.900 (10,836) | C: 0.938 (195) | C: 1.000 (186) | C: 1.000 (206) | C: 1.000 (210) | C: 1.000 (198) | C: 1.000 (978) | C: 1.000 (1,006) | C: 0.984 (1,301) |
| SLC17A3 | A: 0.093 (1,219) | A: 0.100 (1,206) | A: 0.062 (13) | A: 0.016 (21) | ||||||||
Note
- Values indicate the allele frequency, with the observed numbers in parentheses. The upper and lower alleles are the major and minor alleles in the Inabe cohort, respectively. JP‐Inabe is Japanese in the Inabe cohort; JPT is Japanese in Tokyo, Japan; CDX is Chinese Dai in Xishuangbanna, China; CHB is Han Chinese in Beijing, China; CHS is Southern Han Chinese; KHV is Kinh in Ho Chi Minh City, Vietnam. T2DM, type 2 diabetes mellitus. AFR, Africans; EAS, East Asians; SAS, South Asians; EUR, Europeans.
- a Position in NCBI build GRCh37.p13.
- b Allele frequency obtained from the 1000 Genomes Project through the Ensembl genome browser.
We conducted evolutionary analyses using datasets retrieved from public databases. Information regarding allele frequencies of the target SNPs within four ethnic groups [East Asian (EAS), South Asian (SAS), European (EUR), and AFR] was obtained from the 1000 Genomes Project (http://www.internationalgenome.org/; The 1000 Genomes Project Consortium, 2010) using the Ensembl genome browser (http://www.ensembl.org; Zerbino et al., 2018). The allele frequencies in the four ethnic groups are listed in Table 2 and Table S1. The categories of the ethnic populations are listed at the following URL: http://www.internationalgenome.org/data-portal/population. We also obtained allele frequency data of target SNPs using The Exome Aggregation Consortium (ExAC) Browser (http://exac.broadinstitute.org/). Information on the allele frequencies in great apes (chimpanzee, gorilla, and orangutan) was obtained from the Great Ape Genome Project (GAGP) database (http://biologiaevolutiva.org/greatape/; Prado‐Martinez et al., 2013). Alleles with high frequency in AFRs of present‐day humans, archaic humans, and chimpanzees were defined as ancestral alleles.
A nucleotide at the homologous target SNPs within vertebrate reference genomes was investigated in the Multiz Alignments of 100 Vertebrates (Blanchette et al., 2004) and PhyloP score (Pollard, Hubisz, Rosenbloom, & Siepel, 2010) in the University of California, Santa Cruz (UCSC) Genome Browser database (http://genome.ucsc.edu; Kent et al., 2002). An effect of amino acid substitution on protein function was predicted by SIFT (Ng & Henikoff, 2003; Vaser, Adusumalli, Leng, Sikic, & Ng, 2016), PolyPhen‐2 (Adzhubei et al., 2010), and Combined Annotation Dependent Depletion ([CADD]; Rentzsch, Witten, Cooper, Shendure, & Kircher, 2019) scores.
2.3 Estimates of LD
Genotype data used in the present study were formatted for the programs described below using R software version 3.5.1 (R Foundation for Statistical Computing; R Core Team, 2018) through RStudio 1.1.456 (RStudio, Boston, MA; http://www.rstudio.com/) (RStudio Team, 2016) and Perl script (version 5.26.2; https://www.perl.org/get.html).
We conducted the LD analysis for a candidate SNP whose allele frequency may have largely changed after Out‐of‐Africa. The LD among SNPs was estimated using JMP Genomics version 9.0 software (SAS Institute). In addition, we surveyed the LD between a target SNP and adjacent SNPs in JPT (Japanese in Tokyo, Japan) from the 1KGP, employing LDlink web‐based tools (https://analysistools.nci.nih.gov/LDlink/; Machiela & Chanock, 2015).
2.4 Inference of evolutionary events
A sliding window analysis (window size = 100 SNPs and step size = 20 SNPs) was implemented to examine the nucleotide diversity (π) among phased haplotypes from the 1KGP (note that the sequence length of haplotypes is not considered). Tajima's D statistics (Tajima, 1989) was used to test the significance of departure from neutral expectations after removing possible recombination sites estimated. These analyses were performed using DnaSP 5.10.01 software (Librado & Rozas, 2009). The extended haplotype homozygosity (EHH; Sabeti et al., 2002) and integrated haplotype score (iHS; Voight et al., 2006) were used to infer recent positive selection using the R package “rehh 2.0” (Gautier, Klassmann, & Vitalis, 2017; Gautier & Vitalis, 2012). A signal of positive selection in ancestral modern human populations was surveyed by the Selective Sweep Scan on human versus Neanderthal polymorphisms (Green et al., 2010) in the UCSC Genome Browser.
3 RESULTS
To explore SNPs whose allele frequencies have remarkably changed after Out‐of‐Africa, based on information available on the 1KGP database and our genotype data for the Inabe cohort, we firstly examined allele frequencies of SNPs closely located in genomic regions using 272 disease‐associated SNPs identified in our previous studies, because it is possible that SNPs densely located are in strong LD. As a result, 52 SNPs were clustered in six genomic regions: more than five SNPs were located within 100 kb genomic intervals (Table S1). In addition, we also surveyed allele frequencies of 28 SNPs that have been newly identified in our longitudinal exome‐wide association studies (Table 1). Of the 28 SNPs, 17 were removed from further analyses because their allele frequencies were not largely different among the four ethnic groups, and three were included in the SNP clusters mentioned above. A total of 60 SNPs in 10 genomic regions were used for further analyses. Genome scans to detect positive selection were comprehensively conducted throughout the 10 genomic regions around disease‐susceptibility loci, using genotype data from the 1KGP. After candidate SNPs were filtered out by the EHH analyses (Figures S1–S5), potential signatures of positive selection were detected in four genomic regions around nine genetic determinants of T2DM [rs6414624 (C → T, p.H258Y) of EvC ciliary complex subunit 1 (EVC) and rs78338345 (C → G, p.E147Q) of golgi associated, gamma adaptin ear containing, ARF binding protein 3 (GGA3) (Yasukochi et al., 2019)], obesity [rs7656604 (G → A) at 4q13.3 (Yasukochi et al., 2018d)], dyslipidemia [rs34902660 of solute carrier family 17 member 3 (SLC17A3) and rs1042127 of corneodesmosin (CDSN) (Yasukochi et al., 2018c)], or hypertension‐dyslipidemia‐hyperuricemia [rs12229654 (T → G) at 12q24.11, rs3782886 (T → C, p.R241R) of BRCA1 associated protein (BRAP), rs11066015 (G → A) of acyl‐CoA dehydrogenase family member 10 (ACAD10), rs671 (G → A, p.E504K) of aldehyde dehydrogenase 2 family member (ALDH2), and rs2074356 (G → A) and rs11066280 (T → A) of HECT domain E3 ubiquitin protein ligase 4 (HECTD4) (Yasukochi et al., 2017, 2018a, 2018c)].
3.1 Susceptibility loci for T2DM
According to information on the 1KGP database (Table 2), the frequency of the minor allele “G” of rs78338345 in GGA3 in the Inabe cohort was remarkably low outside of East Asia (EAS = 0.079, SAS = 0.002, and EUR and AFR = 0.000). Based on information available on the ExAC Browser, the MAF was also extremely low in the three non‐EAS populations (0.0002–0.0012) while it was relatively high in EAS (0.1016). Analyses implemented in our previous study suggested that the MAF of this SNP has increased across East Asians in recent evolutionary time, although a strong signal of recent positive selection was not detected (Yasukochi et al., 2019).
The LDproxy search in the LDlink web‐based application indicated that rs78338345 of GGA3 and rs28372681 near nucleoporin 85 (NUP85) were in significant LD (r2 = 0.882) in JPT (Table S2). However, rs28372681 has not been reported to be associated with T2DM. The LDproxy also showed no SNP in LD with rs6414624 of EVC (r2 < 0.2) in JPT (Table S3). These results suggest that the two SNPs identified in our longitudinal exome‐wide association studies affect the prevalence of T2DM independently.
We conducted sliding window analyses for nucleotide diversities in a 24.6 kb genomic region around GGA3 in the four ethnic groups from the 1KGP (Figure 1). The π values in non‐AFRs were entirely lower than those in AFRs. In particular, π values of five windows in a 5.2 kb region containing rs78338345 in non‐AFRs (π = 0.002–0.015) were significantly (p = 5.4 × 10−7, Welch's t test) lower than those in AFRs (π = 0.027–0.034). Tajima's D test for the 5.2 kb region indicated significant (p < .05) negative D values in all the four ethnic groups (AFR = −2.33, EAS = −2.04, EUR = −2.52, SAS = −2.49). A negative D value can result from selective sweep (recent positive selection), purifying selection, or recent population growth.

We additionally performed EHH analyses in a 2.5 Mb genomic region around rs78338345, using genotype data from the 1KGP. The analyses indicated that EHH values of the derived allele “G” were considerably higher than those of the ancestral allele “C” throughout the region examined in EAS whereas high EHH values of the derived allele were not observed in SAS (Figure 2), suggesting that GGA3 can be a candidate target of positive selection in EAS.

Using variant call format datasets of 38 chimpanzees, Pan troglodytes and Pan paniscus, from the GAGP database, we examined genotypes in the site homologous to human rs78338345. No chimpanzee examined had the minor allele “G”. In addition, the major allele “C” might be evolutionarily conserved in vertebrates including archaic humans, according to information on the UCSC Genome Browser database (Table S4). These results suggest that the “G” allele may be unique to modern humans, although further research for genetic polymorphisms in each vertebrate species is required.
According to information on the 1KGP database, the major (ancestral) allele frequency of rs6414624 (“C”) in EVC in the Inabe cohort was high outside Africa (EAS = 0.918, SAS = 0.831, EUR = 0.793, and AFR = 0.502), indicating that the ancestral allele frequencies may have increased after Out‐of‐Africa. We mainly focused on SNPs whose frequencies of derived allele rapidly increased because it is difficult to detect signals of recent positive selection targeting the ancestral allele, compared with the derived allele. However, we preliminary conducted the iHS analysis, which examines the ratio of EHH decay between the ancestral and derived alleles, for rs6414624 in EVC. The iHS value around rs6414624 was negative (iHS = −4.22, p = 2.4 × 10−5), suggesting a long‐range haplotype carrying the derived allele. In addition, EHH values for ancestral and derived core alleles were not largely different within the four ethnic groups (Figures S6). At least, therefore, recent positive selection is unlikely to target the ancestral allele in EVC although further analyses such as genome scans for detecting signals of positive selection on standing genetic variation are required.
3.2 Susceptibility loci for obesity
The frequency of major allele “G” at rs7656604 located between solute carrier family 4 member 4 (SLC4A4) and vitamin D binding protein (GC) genes in the Inabe cohort (0.916) was remarkably low in AFRs (0.200; Table 2), suggesting recent positive selection at this locus after Out‐of‐Africa. To estimate positive selection, we conducted sliding window and LD analyses using SNP data in a ~100 kb genomic region at 4q13.3 from the 1KGP. The sliding window analysis indicated that π values among phased haplotypes in a 2 kb region (four windows) around rs7656604 were significantly (p = .0005, Welch's t test) low in non‐AFRs (π = 0.001–0.012) compared with AFR (π = 0.033–0.046; Figure 3). This suggests that the observed nucleotide diversities were greatly reduced by positive selection (selective sweep) in non‐AFRs. However, LDs among SNPs in the genomic region appeared to be disrupted in EAS (Figure S7). Additionally, the EHH test did not show a wide range of high EHH values (EHH >0.9 across ~7.5 kb in EAS at maximum) for the major (derived) allele “G” (Figure S8). We further conducted Tajima's D tests for the 2 kb region nearby rs7656604 to examine the significance of deviation from neutral expectations. The results indicate significant (p < 0.05) negative D values in all the four ethnic groups (AFR = −1.91, EAS = −2.08, EUR = −1.76, SAS = −2.03).

As mentioned above, rs7656604 is located near GC, and, in a previous study (Mozzi et al., 2014), a signal of recent positive selection was detected in this gene within JPT/CHB (Han Chinese in Beijing, China), CEU (Utah Residents with Northern and Western European Ancestry from the CEPH collection), and YRI (Yoruba in Ibadan, Nigeria). According to LDpair, a LDlink web‐based application, rs7656604 and five positively selected SNPs in JPT/CHB (rs2298850, rs11723621, rs62302167, rs1155563, and rs61743452) identified by the previous study were not in LD (r2 = 0.014–0.036), suggesting that the allele frequency of rs7656604 was not affected by the selected SNPs.
According to information available on the UCSC Genome Browser (Table S5) and GAGP databases, the minor allele “A” of rs7656604 in non‐AFRs appears to be evolutionarily conserved in primates including archaic humans. Thus, the majority of the “G” allele is likely human‐specific. The Selective Sweep Scan (Neanderthal vs. human polymorphisms) in the UCSC Genome Browser implied that rs7656604 has undergone positive selection in ancestral modern human populations since Out‐of‐Africa.
3.3 Susceptibility loci for dyslipidemia
The LDproxy analyses indicated that rs34902660 of SLC17A3 and rs1042127 of CDSN were in significant LD (r2 > 0.8) with 281 and 87 neighbor SNPs, respectively (Tables S6–S7), although the neighbor SNPs are not associated with dyslipidemia‐related phenotypes, according to the GRASP database (Yasukochi et al., 2018c). This suggests that these SNPs can be independently associated with the prevalence of dyslipidemia. Based on allele frequency data from the 1KGP, the minor allele “A” at rs34902660 of SLC17A3 in the Inabe cohort was observed in JPT and AFR only (Table 2). According to information available on the ExAC Browser, the minor allele was present at an extremely low frequency in SAS and EUR (MAF < 0.0007). We conducted the EHH analysis in a 100‐kb genomic region containing SLC17A3 using genotype data from the 1KGP. The analyses displayed high EHH values (>0.8) of the minor (derived) allele “A” across 90 kb and 61 kb genomic regions in JPT and YRI, respectively (Figure 4). As the derived allele was not observed in East Asia outside Japan, no high EHH values of the derived allele were observed in CHB based on SNP data from the 1KGP. The extent of high EHH values of the major (ancestral) allele “C” (<2 kb) was remarkably small in JPT and YRI compared with the EHH values of the derived allele (Figure 4). However, Tajima's D test for the 20 kb genomic region containing rs34902660 in JPT did not show a significant D value (D = −1.25, p > .10).

According to information available on the GAGP database, no derived “A” allele was observed at the homologous target SNP in 38 chimpanzees. In addition, Multiz Alignments of 100 Vertebrates in the UCSC database indicate that all primates including archaic humans possess the ancestral allele “C” (Table S8). These results suggest that the “C” allele may have been conserved in primates, although further research for genetic polymorphisms in each primate species is required.
3.4 | Common susceptibility loci for several complex disorders
Six SNPs (rs12229654 at 12q24.11, rs3782886 of BRAP, rs11066015 of ACAD10, rs671 of ALDH2, and rs2074356 and rs11066280 of HECTD4) associated with several complex disorders in the Inabe cohort (Yasukochi et al., 2017, 2018a, 2018c) are closely located in the chromosomal region 12q24.11–q24.13. The LDproxy analyses indicated that rs671 in ALDH2 and five SNPs, including three disease‐associated SNPs detected by our longitudinal exome‐wide association studies, were in significant LD (r2 > 0.8) in JPT from the 1KGP (Table S9). Recent positive selection is likely to have acted on the genomic region containing rs671, and the minor alleles have expanded throughout East Asia in recent evolutionary time (Koganebuchi et al., 2017; Luo et al., 2009; Oota et al., 2004; Yasukochi et al., 2017).
In the present study, the sliding window analyses using SNP data in a ~2.5 Mb genomic region at 12q24.11–q24.13 from the 1KGP indicated that π values among phased haplotypes in a ~40 kb region (eight windows) around rs671 were not significantly (p = .299, Welch's t test) different between EAS (mean π = 0.028 ± 0.006) and AFR (mean π = 0.026 ± 0.005; Figure S9). However, Tajima's D test for the 40 kb region indicated a significant negative D value in EAS (D = −2.79, p < .001) after possible recombination sites were removed. We also conducted the EHH analysis for core SNP rs671 to show a signal of positive selection using genotype data across ~2.5 Mb genomic region at 12q24.11–q24.13 in EAS and SAS from the 1KGP (Figure S10). In EAS, EHH values of the rs671‐derived allele “A” were considerably higher than those of the ancestral allele “G” across the genomic region examined. In contrast to EAS, no high EHH value of the derived allele was observed in SAS. We thus confirmed the presence of positive selection operating on the genomic region around ALDH2 in EAS.
4 | DISCUSSION
4.1 Susceptibility loci for T2DM
Individuals with East Asian ancestry are more susceptible to T2DM than those with European ancestry despite a similar BMI (Ma & Chan, 2013). It has been reported that the prevalence of T2DM in Japanese Americans who have western dietary habits was higher than that in European Americans because of reduced insulin secretory capacity in Japanese or East Asian populations (Yabe, Seino, Fukushima, & Seino, 2015). Differences in genetic background and dietary habits may underlie the different insulin secretory capacity between individuals with East Asian and European ancestries.
The GGA3 is involved in the intracellular trafficking of insulin‐like growth factor 2 receptor (IGF2R) protein between the trans‐Golgi network and early endosomes (Scott, Fei, Thomas, Medigeshi, & Thomas, 2006). This protein also interacts with ADP ribosylation factor 6 (ARF6; Parachoniak, Luo, Abella, Keen, & Park, 2011) that regulates insulin secretion in the pancreatic β cell (Jayaram, Syed, Kyathanahalli, Rhodes, & Kowluru, 2011; Lawrence & Birnbaum, 2003). Given that GGA3 is related to the insulin secretion pathway with IGF2R or ARF6, the EAS‐specific nonsynonymous substitution in rs78338345 might be involved in the insulin secretory capacity. According to the PhyloP score, the ancestral allele “C” at rs78338345 appears to have been evolutionarily conserved in vertebrates (score = 6.08), consistent with the result of Multiz Alignments of 100 Vertebrates. PolyPhen prediction suggested deleterious effects on the protein function (PolyPhen score = 1.00) although two other protein function prediction tools suggested that the amino acid replacement of rs78338345 (p.E147Q) in GGA3 might not disrupt protein function (SIFT score = 0.134 and CADD scaled C‐score = 24.8). However, the derived allele “G” was identified as a protective allele against T2DM (Yasukochi et al., 2019). Functional analyses are required to elucidate relationships between the candidate SNP in GGA3 and insulin secretory ability.
The rs78338345 of GGA3 was identified as a novel genetic determinant of T2DM in our previous study (Yasukochi et al., 2019), and the frequency of the derived allele “G” was higher in EAS than in other ethnic groups. In the present study, our estimates suggest that this allele may have expanded throughout East Asia. As the GGA3 protein may be involved in immune responses mediated by target proteins such as clathrin (Benedicto et al., 2015) and ARF6 (Wu & Kuo, 2012), rs78338345 might have spread across EAS to cope with infectious diseases although the biological significance remains unclear.
Because Tajima's D tests suggested that allele frequencies of SNPs around GGA3 are skewed in the populations examined, we cannot exclude the possibility that demographic events affected the allele frequencies within EAS. In addition, it is possible that the haplotype diversity of core allele that emerged by the recent new mutation can be low, resulting in a high EHH without positive selection due to few recombination events. According to ExAC Browser, the derived allele “G” of rs78338345 is also observed outside of East Asia although the frequencies are very low. One may hypothesize that the derived allele has been maintained at an extremely low frequency during the course of modern human evolution, and the frequency may have largely increased in East Asia in very recent evolutionary time. It will be intriguing to elucidate the evolutionary forces shaping the genetic variation in the genomic region around GGA3. Further analyses based on population genomics are required to verify the evolutionary history of GGA3.
4.2 Susceptibility loci for obesity
The frequency of the derived allele “G” (major allele in non‐AFRs) at rs7656604 near GC appears to have increased considerably outside of Africa. Our previous study indicated that this allele is protective against increased BMI (Yasukochi et al., 2018d). The sliding window analysis indicated a pronounced decrease in nucleotide diversity across a ~2 kb region around rs7656604 in non‐AFRs, implying recent positive selection operating on this locus after Out‐of‐Africa. A signal of recent positive selection on five SNPs in GC of EAS was also detected in a previous study (Mozzi et al., 2014). It is conceivable that positive selection has independently operated on rs7656604 and the positively selected sites in GC because no pairs of the different sets of SNPs were in significant LD (r2 < 0.04). If positive selection has actually operated on rs7656604, one may hypothesize that frequencies of several haplotypes with the derived allele “G” lineage in non‐AFRs have immediately increased after Out‐of‐Africa. This may be caused by selective sweep on standing genetic variation. Further analysis for population genomics is required to accurately elucidate the evolutionary history of the genomic region around rs7656604.
The GC protein is a main carrier for vitamin D and its metabolites. Vitamin D deficiency may be related to the development of adiposity or increased BMI (Snijder et al., 2005). The rs7656604 nearby GC might thus be related to the increased BMI through the effect on vitamin D synthesis. Cutaneous vitamin D production is affected by exposure to ultraviolet (UV) radiation and skin pigments (Jablonski & Chaplin, 2017), and lighter skin color in non‐AFRs may have evolved to optimize vitamin D production for shorter exposure to daily sunlight (Wilde et al., 2014). Since frequencies of the derived allele in rs7656604 were similar among non‐AFRs, irrespective of geographical distributions based on latitude, the derived allele “G” lineage might have been targeted by natural selection to fine‐tune vitamin D homeostasis under UV and food conditions outside Africa. Alternatively, the derived allele might have expanded throughout ancestral modern humans to acquire resistance to novel pathogens outside Africa, because vitamin D plays an important role in the modulation of the immune system (Calton, Keane, Newsholme, & Soares, 2015; Kamen & Tangpricha, 2010). Further research for the correlation of GC expression profiles with rs7656604 is required to examine whether rs7656604 is involved in the regulation of GC expression levels.
4.3 Susceptibility loci for dyslipidemia
The rs34902660 “A” allele in SLC17A3 at 6p22.2 was only observed in JPT and AFR from the 1KGP. This suggests that the derived allele “A” may have increased in JPT in very recent evolutionary time. However, the negative Tajima's D value in JPT was not statistically significant. It may be difficult to detect the weak signal of recent positive selection by the frequency spectrum neutrality test because the selective sweep could be occurring in JPT in very recent evolutionary time. According to the ExAC Browser database, the derived allele “A” is observed at an extremely low frequency in SAS and EUR populations. This suggests that the derived allele has existed in common ancestral populations of modern humans. One may hypothesize that the derived allele frequency has recently increased in EAS populations after the frequency largely reduced in non‐AFR ancestral populations after Out‐of‐Africa. To validate our results, further scans for the footprint of positive selection are required.
The SLC17A3 protein is involved in efflux transport of intracellular urate and organic anions from the blood into the renal tubule cells. This protein plays a key role in urate secretion in the renal tubular cell (Jutabha et al., 2010). It is possible that an increase in xanthine oxidoreductase activity is related to elevated serum concentration of uric acid, resulting in the development of dyslipidemia (Maiuolo, Oppedisano, Gratteri, Muscoli, & Mollace, 2016). According to the SIFT prediction, rs34902660 of SLC17A3 might cause a dysfunctional transcript [p.G239V (SIFT score = 0.018)]. Therefore, the association of rs34902660 of SLC17A3 with dyslipidemia might be attributable to the effect of this gene on abnormal lipid profiles through impaired urate secretion. The serum concentration of uric acid is associated with the production of pro‐inflammatory cytokines (Crișan et al., 2016; Spaetgens et al., 2017). Given that the accelerated production of cytokines through the increased serum uric acid level is important for host defense against viral, bacterial, and parasite infection, the derived allele of rs34902660 might confer resistance to endemic or epidemic diseases in Japan. However, some prediction tools did not support the result of SIFT prediction (PolyPhen = 0.218, CADD scaled C‐score = 14.8, and PhyloP = 0.06). Further functional analyses are required to clarify the effect of rs34902660 on the protein function.
4.4 Common susceptibility loci for several complex disorders
Previous studies indicated that derived alleles of several SNPs at 12q24.11–q24.13 have expanded throughout East Asia in recent evolutionary time (Kato et al., 2011; Yasukochi et al., 2017). Our association studies suggested that derived alleles of six EAS‐specific SNPs are protective against hypertension (Yasukochi et al., 2017) and hyperuricemia (Yasukochi et al., 2018a) but increase susceptibility to dyslipidemia (Yasukochi et al., 2018c). This discrepancy might be due to differences in the effects of these SNPs on functional pathways involved in disease‐related phenotypes.
Of the six SNPs, the major genetic factor for the prevalence of complex disorders may be rs671 of ALDH2, because this SNP only results in an amino acid change at position 504. The amino acid change (p.E504K) is well known to cause defective enzyme activity (Crabb, Edenberg, Bosron, & Li, 1989; Hsu, Bendel, & Yoshida, 1987). In fact, the function prediction tools suggest that the missense variant has an effect on the protein function (PolyPhen score = 0.964, SIFT = 0.019, CADD scaled C‐score = 34.0, and PhyloP = 9.71). It is possible that the derived allele of rs671 may have arisen ~8,000 years ago when large‐scale rice cultivation started in southern China (Koganebuchi et al., 2017). Previous studies suggested that positive selection has acted on ALDH2 in EAS because of acquired resistance against parasite infection related to large‐scale rice cultivation (Koganebuchi et al., 2017; Luo et al., 2009; Oota et al., 2004). ALDH2 may be involved in the metabolism of 4‐hydroxy‐2‐nonenal, a byproduct of lipid peroxidation (Breitzig, Bhimineni, Lockey, & Kolliputi, 2016). Therefore, it is possible that the decreased ALDH2 enzyme activity may be associated with the prevalence of complex disorders through accelerated aging phenotypes because of the increased susceptibility to oxidative stress. Although the functional relevance of rs671 to the development of complex disorders remains unclear, the drastic functional change in the ALDH2 enzyme may have additional effects on functional pathways related to several complex disorders.
4.5 Study limitations
It is possible that population demographic events such as population growth and bottlenecks as well as recombination rate yield genetic patterns that are similar to those generated by natural selection (Bamshad & Wooding, 2003; Nielsen, 2005). A haplotype diversity of core allele arisen from the recent new mutation can also result in high EHH values without positive selection. Therefore, the fluctuation in allele frequencies, LD, or genetic diversities in the genomic regions examined should be carefully interpreted. Further analyses based on population genomics are required to verify the evolutionary history of these disease‐susceptibility loci.
5 CONCLUSION
Our findings indicate that derived allele frequencies of nine disease‐associated SNPs (rs78338345 of GGA3, rs7656604 at 4q13.3, rs34902660 of SLC17A3, and six SNPs closely located at 12q24.1) in Japanese populations largely increased after Out‐of‐Africa.
ACKNOWLEDGMENTS
This work was supported by Research Grant (OKF‐17‐1‐1) from Okasan Kato Culture Promotion Foundation (to Y. Yasukochi), the Kurata Grant (Grant number, 1323) awarded by the Hitachi Global Foundation (to Y. Yasukochi, Y. Yamada), CREST (JPMRJCR1302) of the Japan Science and Technology Agency (to Y. Yamada, J. Sakuma, I. Takeuchi), and by Japan Society for the Promotion of Science KAKENHI grants JP17H00758 (to I. Takeuchi, Y. Yasukochi) and JP15H04772 (to Y. Yamada).
CONFLICT OF INTEREST
None declared.
AUTHOR CONTRIBUTIONS
Y. Yasukochi contributed to design the study, analysis and interpretation of the data, and drafting of the manuscript. J. Sakuma and I. Takeuchi contributed to analysis and interpretation of the data as well as revision of the manuscript. K. Kato, M. Oguri, T. Fujimaki, and H. Horibe each contributed to acquisition of the data and revision of the manuscript. Y. Yamada contributed to acquisition, analysis, interpretation of the data, and revision of the manuscript.




