Fine‐mapping of HLA class I and class II genes identified two independent novel variants associated with nasopharyngeal carcinoma susceptibility

Abstract Background Several genome‐wide association studies (GWASs) have identified strong associations between genetic variants in the human leukocyte antigen (HLA) region and nasopharyngeal carcinoma (NPC). However, given the complex LD pattern in this region, the causal variants and the underlying mechanism of how genetic variants in HLA contribute to NPC development is yet to be understood. Methods To systematically characterize the HLA variants and their relationship to NPC susceptibility, we fine‐mapped the HLA genes based on the GWAS data of 1583 NPC cases and 972 healthy controls, using SNP2HLA with the Pan‐Asian panel as references. Stepwise conditional regression was used to identify independent association loci. Results Interestingly, the most significant association was the presence of Gln in HLA‐A amino acid position 62 (OR = 0.57, P = 1.41 × 10−16). The G allele of rs2894207 located between HLA‐B and HLA‐C showed protective effect of NPC development (OR = 0.52, P = 2.23 × 10−13). Additionally, amino acid Phe‐67 located in the peptide‐binding pocket of HLA‐DRB1 was identified as a novel functional variant with OR = 0.64 and P = 9.64 × 10−11. Another novel variant, Glu‐45 in HLA‐B pocket B, conferred a protective effect on NPC susceptibility (OR = 0.64, P = 5.23 × 10−8). These four variants explained 2.07% of the phenotypic variance for NPC risk. Conclusion In summary, by fine‐mapping the HLA region in south Chinese population, we reported additional loci missed in the GWAS studies and provided a better understanding of the relationship between HLA and NPC susceptibility.


| INTRODUCTION
Nasopharyngeal carcinoma (NPC) is an epithelial squamous cell carcinoma with a distinct incidence around the world. In endemic regions including southern China and other Southeast Asian countries, the incidence rate reached 25-50 per 100 000 people, which is remarkably higher than that in the rest of the world. [1][2][3][4] In addition to the unbalanced geographic distribution, familial clustering of NPC was often observed, and approximately 10% of NPC patients have familial history, 5 indicating the genetic contributions to NPC susceptibility. Apart from host genetics, environmental exposures and Epstein-Barr virus (EBV) infection also play important roles in NPC development. Tobacco smoking and salted fish consumption are associated with increased NPC risk. [6][7][8] Several long-term prospective studies showed that people who carried elevated VCA-IgA, EA-IgA, or DNase may have a 20-30 fold increase in NPC risk in endemic areas, 9,10 and a recent study showed that EBV plasma DNA was a useful biomarker for early detection of NPC. 11 Among the host genetic markers that have been associated with NPC, strong, and consistent association signals have been shown in the human leukocyte antigen (HLA) region. In the genome-wide association studies (GWASs) across southern China, Taiwan, and Malaysia, a total of 15 SNPs in the HLA region were identified and confirmed as association markers of NPC risk. [12][13][14][15] Proteins encoded by HLA class I and class II genes affect host responses to EBV infection through viral antigen presentation. Variations in amino acid residues at specific positions may influence the antigen-presenting process and facilitate tumor cell evasion of host immune surveillance. 16 Recently, a GWAS study fine-mapped the HLA class I genes and detected several functional amino acids variants in the HLA-A, HLA-B, and HLA-C loci. 14 In addition, another GWAS study in Malaysian Chinese used high-resolution finemapping on HLA-A and identified some variants in the antigen peptide-binding groove and T-cell receptor binding site. 13 In our previous GWAS in southern Chinese, three NPC protective SNPs were detected in this region including one located in 4 kb upstream of HLA-A, one in the intergenic region of HLA-B and HLA-C and one located between the HLA-DRB1 and DQA1 loci. 12 However, interpretations of these results are difficult because in the current opinion, the reported SNPs are proxy for association signals without direct biological function. In addition, the identification of driving variants is problematic owing to the complicated LD pattern of this gene-dense region. To further study these loci, we conducted a fine-mapping study on our preexisting GWAS data using SNP2HLA, 17 an HLA-specific imputation tool to systematically explore the potential causal variants in the whole HLA region that drives the predisposition to nasopharyngeal carcinoma in southern Chinese population.

| Subjects, genotyping, and quality control
Subjects were from a large GWAS study and were fully described elsewhere. 12 In brief, a total of 1615 cases were recruited from Sun Yat-sen University Cancer Center and were pathologically diagnosed according to the WHO classification. During the same period, 1025 healthy controls from 21 municipalities in Guangdong Province were recruited from physical examination centers of hospitals in Guangdong and had no history of malignancy. All cases and controls in this study are self-reported as Guangdong Chinese and lived in Guangdong Province at the time of the study. Age (±5 years), gender, geographic region, and ethnicity are matched by frequency between cases and controls.
Genotyping of the 2640 GWAS samples was performed using Illumina Human610-Quad BeadChips (620 900 SNPs). We followed the previously described quality control procedures 12 and removed SNPs with call rates < 95%, SNPs with minor allele frequency (MAF) <3% and SNPs with deviation from Hardy-Weinberg equilibrium in controls (P < 10 −6 ). In addition, samples with genotype call rates <96%, duplicates or relatives identified by identity-by-descent analysis and population outliers detected by principal component analysis were also excluded. After quality control, a total of 465 618 SNPs in 1583 cases and 972 controls were included in this study.

| Imputation of HLA variants
The study design and analysis workflow are shown in Figure  S1. First, 2602 SNPs within the entire HLA region from 29 to 34 Mb on chromosome 6 (NCBI Build 36) were extracted from the GWAS data described above. Using these data, we imputed two-digit and four-digit classical HLA alleles, Conclusion: In summary, by fine-mapping the HLA region in south Chinese population, we reported additional loci missed in the GWAS studies and provided a better understanding of the relationship between HLA and NPC susceptibility.

K E Y W O R D S
fine-mapping, human leukocyte antigen, nasopharyngeal carcinoma, susceptibility amino acid polymorphisms in HLA-A, HLA-B, HLA-C, and HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, and HLA-DPB1 as well as the un-genotyped SNPs in this region using SNP2HLA software. 17 The Pan-Asian reference panel was used in our analysis, which includes Han Chinese population, Southeast Asian Malaysian population, Tamil Indian and Japanese population. 18,19 After imputation, quality control was conducted again to exclude variants with low imputation quality. Here, we removed variants with INFO<0.5 or MAF<0.01. The INFO score was defined by the ratio of the observed variance in dosage to the expected variance under Hardy-Weinberg equilibrium. Finally, a total of 6124 variants, which consisted of 5144 SNPs, 832 amino acid polymorphisms and 148 two-digit and four-digit classical HLA alleles remained in our further analysis.

| Statistical analysis
All association analyses were conducted by R and plink (v1.07). There were four types of HLA variants in our imputed data set including bi-allelic SNPs, bi-allelic classical HLA alleles, bi-allelic amino acid polymorphisms, and multi-allelic amino acid polymorphisms. The effects of all variants were assumed to be additive on a log (Odds) scale and represented by the allele dosages (posterior genotype probability). For each bi-allelic variant, a traditional logistic regression model adjusted with age and gender was used to test its effect. For each multi-allelic amino acid polymorphism consisting of m alleles, two types of test were conducted. First, the effect of each allele (Present vs Absent) in a specific multi-allelic position was tested using traditional logistic regression as dichotomous variable. Additionally, to investigate the overall effect of the multi-allelic site, an omnibus P value was calculated based on the log-likelihood ratio test by comparing the likelihood of the null model L 0 (2) to the likelihood of the full model L 1 (1). 20 Here, the full model is represented by where β 0 is the regression intercept, β 1,i is the effect size of dosage allele i among the m alleles and β 2 , β 3 are the effect sizes of age and gender, respectively. The null model is represented by and the test statistic with m-1 degree of freedom was written as After imputation and post-imputation quality control, 6124 variants were considered in our study. We set the significance threshold of the association study by applying Bonferroni correction (8.16 × 10 −6 , 0.05/6124).
Stepwise conditional regression analysis was performed to explore the potential independent driving variants of NPC. In each step, the most significant variant was added as a covariate in the model until no remaining variants can reach the significant threshold (P = 8.16 × 10 −6 ). Then, a permutation test was also used to test the identified variants and confirm the result of our discovery. The binary phenotypes were resampled 10 000 times to generate permutation test statistics.
Permutation P values were calculated by comparing the observed test statistic and the permutation test statistic. 21 We tested the cumulative or additive effect of the candidate variants by counting the number of protective alleles at each locus and used the Cochran-Armitage trend test to detect the trend effect. 22,23 The cumulative ORs for subjects carrying different copies of protective alleles were estimated by comparing them with those carrying none of these protective alleles. LD statistics (D′ and r 2 ) were calculated by Haploview version 4.2 to explore the relationship between variants. 24 Haplotype estimation was performed by the R package Haplostats. 25

| Variance explained by identified variants
The variance of phenotype explained by specific groups of genetic loci was estimated using a fixed effects model as described previously. 20,26,27 The four identified variants in our study, seven SNPs reported in our previous GWAS study on NPC were used to calculate the respective variances. 12 We used the NPC five-year prevalence 28 to transform the estimated heritability from an observed scale to a liability scale.

| Association analysis for single variants in HLA region
After the association analysis was performed for each variant in a logistic regression model, a total of 357 variants showed significant associations with NPC, including 270 SNPs, 17 bi-allelic and 59 multi-allelic amino acid polymorphisms and 11 two-digit and four-digit HLA classical alleles (Table S1).
In the previous GWAS study on NPC, a total of 15 SNPs in the HLA region were reported. [12][13][14][15] In this study, we replicated three SNPs in GABBR1, two SNPs in HLA-A, two SNPs in HCG and one SNP in HLA-B/C with association significance at GWAS level (P < 5 × 10 −8 ). In addition, two risk alleles in the HLA-F locus (rs3129055 and rs9258122) detected by Tse et al 15 also showed significance with P values at the level of 10 −5 . Four SNPs in HLA-A were not imputed because none of them were included in the Pan-Asian reference panel (Table S2).
Among the 148 2-digit and 4-digit HLA classical alleles included in our analysis, 11 of them (located in HLA-A, B, DRB1, and DQB1 loci) were significantly associated with NPC. Our result further confirmed the previous discovery of association between classical alleles in class I and class II HLA genes and NPC susceptibility (Table S3). To further explore the combination effect of the significant alleles, we estimated their haplotypes using the significantly associated HLA alleles above and found a total of four haplotypes constructed by class I and class II genes that were associated with NPC susceptibility. Two NPC protective haplotypes, HLA  (Table S4). We detected 42 amino acid polymorphisms in class I genes (HLA-A and HLA-B) and 12 in class II genes (HLA-DQB1 and HLA-DRB1) as significant association signals in our study (Table S5). Association signals were led by a previously identified amino acid HLA-A Gln-62 (OR = 0.54, P = 2.81 × 10 −22 ). Previous studies have shown that HLA-A Gln-62 marked the HLA-A*11:01 allele together with several HLA-A amino acids nearby in the same LD block. 12,13 In addition, some amino acids in HLA-B also showed significant with NPC susceptibility, led by the amino acid Leu at position −16 and 116 in high LD with each other (D′ = 0.928, r 2 = 0.58). Although HLA-C Trp-156 was detected as an association signal in Tang's research, 14 it did not reach the significance level after Bonferroni correction (OR = 0.63, P = 6.59 × 10 −5 ). For the amino acids in HLA class II genes, we discovered twelve amino acid polymorphisms in HLA-DRB1 and HLA-DQB1 associated with NPC susceptibility (Table S5), led by HLA-DQB1 Glu-45 (OR = 0.62, P = 6.58 × 10 −11 ). LD analysis showed that polymorphism of this amino acid was in high LD not only with other significant amino acid variants in this region ( Figure S2

| Stepwise conditional analysis detected four potential driving variants
Since the HLA region has a complex LD pattern, we identified the independent variants that may drive NPC susceptibility using stepwise conditional analysis. In each step, the most significant variant was included in the next model as a covariate until no variant reached the significance level (P < 8.16 × 10 −6 ). When conditioned on age and gender, the most significant association signal was the HLA-A amino acid Gln variant at position 62 (OR = 0.54, P = 2.81 × 10 −22 ; Figure 1A, Table S6). After adjustment for age, gender, and HLA-A Gln-62, rs2894207, a SNP reported in our previous GWAS study, showed the leading significant signal among the remaining variants (OR = 0.58, P = 1.73 × 10 −10 ) ( Figure 1B, Table S6). When age, gender, HLA-A Gln-62, and rs2894207 were used as covariates, we observed a third independent variant, HLA-DRB1 Phe-67 (OR = 0.67, P = 3.3 × 10 −9 ) ( Figure 1C, Table  S6). After adjustment for age, gender, HLA-A Gln-62, rs2894207, and HLA-DRB1 Phe-67, we identified HLA-B Glu-45 as the fourth most significant variant (OR = 0.64, P = 5.23 × 10 −8 ) ( Figure 1D, Table S6). When conditioning on age, gender, HLA-A Gln-62, rs2894207, HLA-DRB1 Phe-67, and HLA-B Glu-45, no remaining variants in the HLA region reached the Bonferroni threshold ( Figure 1E, Table S6). An LD plot confirmed that these four variants were statistically independent of each other. All of the four variants were further confirmed by permutation testing (Table S6 and S7).
After conducting stepwise conditional analysis, we built a multivariate NPC susceptibility model which incorporated four independent protective HLA variants. All variants in the model showed protective effect (Tables 1 and S7)

| LD analysis of four driving variants and the previously reported variants
Several SNPs and functional amino acid polymorphisms in the HLA region have been detected as susceptibility loci for NPC. 13,14 To explore their relationship with our four identified variants, a systematic LD analysis was performed (Table S8, Figure 2). Strong LD was shown in HLA-A Gln-62 with SNPs in HLA-A and its adjacent SNPs, including rs2860580 (D′ = 0.985, r 2 = 0.941), which was reported in our previous GWAS study. The identified intergenic SNP rs2894207 located between HLA-B and HLA-C was in medium LD with amino acid HLA-B Leu-116 (D′ = 0.545, r 2 = 0.286) and HLA-C Trp-156 (D′ = 0.468, r 2 = 0.103) but in low LD with HLA-B Glu-45 (D′ = 0.158, r 2 = 0.001). The amino acid in class II gene HLA-DRB1 Phe-67 was associated with rs28421666 (D′ = 0.977, r 2 = 0.247) and when conditioning on HLA-DRB1 Phe-67, the association P value of rs28421666 increased dramatically (OR = 0.76, P = 0.00716). Low LD was observed for HLA-B Glu-45 with any other reported variants.

| Cumulative effect of four variants
Taking into account the four potential driving variants with protective effects on NPC, we moved on to explore their cumulative effect. Analysis of a combination of the four variants provided a stronger cumulative association with NPC (P trend = 1.03 × 10 −35 ) than any individual variant (Tables 2  and S7). Compared with subjects without any of the protective variants, individuals with one to four protective variants showed lower NPC risk with ORs of 0.63, 0.33, 0.19, and 0.11.

| Variance explained by four variants
Based on the four identified variants in our study and those detected SNPs located in HLA and non-HLA regions in our previous study, we estimated the genetic variance explained (h 2 ) with a liability threshold model, assuming an NPC prevalence of 0.000078. 28 The four identified variants in our study explained h 2 = 2.07% on a liability scale, whereas the associated HLA and non-HLA SNPs in our previous GWAS study explained 1.73% and 0.40%, respectively ( Table 3). The four variants in this study can explain 72.89% of the phenotypic variance owing to genetic variants.

| DISCUSSION
Human leukocyte antigens (HLA) have been proposed as important host genetic factors associated with NPC, given their central role in viral antigen presentation to the immune system. 29 An association between HLA genes and NPC was first proposed by Simons and colleagues. 30 Since then, the association between HLA and NPC has been further confirmed by over one hundred candidate association studies. In recent F I G U R E 1 Stepwise conditional regression result of HLA loci independently associated with NPC susceptibility. Imputed allelic dosage (between 0 and 2) of HLA variants including SNPs, amino acid polymorphisms, and classic alleles were tested. In each panel, each point represented -log 10 (P value) of the variants, and the dashed line marked the Bonferroni P value (P < 8.16 × 10 −6 ). A, The strongest association was amino acid Gln-62 in the HLA-A locus (OR = 0.54, P = 2.81 × 10 −22 ). B, After adjustment for age, gender, and HLA-A Gln-62, the most significant signal was SNP rs2894207 (OR = 0.58, P = 1.73 × 10 −10 ). C, When age, gender, HLA-A Gln-62, and rs2894207 were used as covariates, we observed HLA-DRB1 Phe-67 to be the most significant variant (OR = 0.67, P = 3.3 × 10 −9 ). D, After adjustment for age, gender, HLA-A Gln-62, rs2894207, and HLA-DRB1 Phe-67, we detected HLA-B Glu-45 as the fourth most significant variant (OR = 0.64, P = 5.23 × 10 −8 ). E, When conditioning on age, gender, HLA-A Gln-62, rs2894207, HLA-DRB1 Phe-67, and HLA-B Glu-45, no remaining variants in the HLA region reached the Bonferroni threshold (Table S6) years, four independent GWASs have consistently reported SNPs in the HLA region to have the strongest association signals with NPC. Given the crucial role of HLA genes, finemapping of the HLA locus is beneficial to further explore the functional variants involved in NPC development. In this study, we imputed the HLA region by an HLA-specific imputation tool SNP2HLA, using the Pan-Asian panel as references and detected four potential driving variants that were independently associated with NPC susceptibility by stepwise conditional regression analysis. We identified two novel amino acid polymorphisms, HLA-DRB1 amino acid Phe in site 67 and HLA-B amino acid Glu in site 45 associated with NPC risks. These two loci, combined with the other two known loci, HLA-A amino acid Gln in site 62 and rs2894207 (located in intergenic region of HLA-B and HLA-C), made a cumulative contribution to NPC risk and explained 2.07% of NPC susceptibility.
The imputation method for fine-mapping has been successfully applied to other immune-related disease such as celiac disease, Graves' disease and type 1 diabetes. 20,26,27 Although we did not genotype the HLA region directly, we fine-mapped this region and repeated associations of some important classic alleles associated with NPC, such as protective alleles HLA-A*11:01, HLA-B*13:01 and risk alleles HLA-A*2, HLA-B*46:01. Additionally, we also replicated some important amino acid polymorphisms in HLA class I genes. The NPC association signals were driven by the presence of glutamine at HLA-A position 62. This variant has been identified in two NPC GWASs 13,14 and is now repeated in our imputation data. It marked the NPC protective classic allele HLA-A*11:01 and was highly correlated with rs2860580, the association signal observed in our previous GWAS study. HLA-A amino acid site 62 is located in exon 2, which encode the alpha 1 domain. Polymorphisms within this region as well as in exon 3 are responsible for the peptide-binding specificity of each class one molecule.
HLA-B Glu-45, located in pocket B of the HLA-B peptide-binding groove, is one of the functional variants that was newly identified in this study. In a previous fine-mapping study of NPC, another amino acid Leu-116 in HLA-B pocket F, was identified as a protective allele in NPC. 14 Although they were close in physical position, a low correlation was shown in the LD analysis. HLA-B plays an important part in the body's immune response to viral attack. A recent GWAS study of HIV immune control in different population demonstrated that specific amino acids in the HLA-B peptide-binding groove played roles in modulating durable control of HIV infection. 31 The signals of NPC association in HLA class II genes were represented by the presence of phenylalanine at HLA-DRB1 position 67, which is a newly identified functional variant associated with NPC in our study. This variant is located in exon 2 of the HLA-DRB1 locus, which plays a central role in the immune system by presenting peptides derived from extracellular proteins. A previous study has identified an amino acid signature at positions 26, 67, 71, and 74 in the peptide-binding pocket of DR β-chain to be strongly associated with type 1 diabetes (T1D) and autoimmune thyroid disease (AITD). 32 In our previous GWAS study, rs28421666 in the DR/DQ region was identified as one of the NPC association loci. When conditioning on HLA-DRB1 Phe-67, the association P value of rs28421666 increased dramatically, suggesting that the previously identified signal in our GWAS study may be driven by this functional variant. Analysis also showed that Phe-67 in DRB1 was in high LD with classic alleles HLA-DRB1*11:01, HLA-DQB1*03, and HLA-DQB1*03:01. While associations were detected for all of them in our study, the systematic study of HLA class I and class II alleles in Taiwanese did not observe their associations with NPC. 33 Therefore, Phe-67 in DRB1 may be a new functional variant playing an important role in NPC development.
In summary, we fine-mapped the HLA region, further exploring the GWAS data of NPC by imputation, and identified four protective variants that may influence NPC susceptibility independently. Our study is beneficial for a deeper understanding of the relationship between host genetic factors and NPC predisposition. Apart from previously identified variants, we reported additional loci that were missed in the GWAS studies and fine-mapping studies, which may further explain the complex HLA association with NPC in the southern Chinese population. Further studies on the functional basis are warranted to uncover the underlying mechanism of how HLA variations contribute to NPC predisposition. Genetic variants used for estimating heritability, including four variants identified in our stepwise conditional regression, three HLA SNPs and four non-HLA SNPs detected in our previous GWAS study. b Five-year prevalence of NPC (0.000078) was used for the heritability estimate on the liability scale. 28