Potential conflict of interest: Nothing to report.
Primary liver cancer is the third most common cause of cancer-related death worldwide, with a rising incidence in Western countries. Little is known about the genetic etiology of this disease. To identify genetic factors associated with hepatocellular carcinoma (HCC) and liver cirrhosis (LC), we conducted a comprehensive, genome-wide variation analysis in a population of unrelated Asian individuals. Copy number variation (CNV) and single nucleotide polymorphisms (SNPs) were assayed in peripheral blood with the high-density Affymetrix SNP6.0 microarray platform. We used a two-stage discovery and replication design to control for overfitting and to validate observed results. We identified a strong association with CNV at the T-cell receptor gamma and alpha loci (P < 1 × 10−15) in HCC cases when contrasted with controls. This variation appears to be somatic in origin, reflecting differences between T-cell receptor processing in lymphocytes from individuals with liver disease and healthy individuals that is not attributable to chronic hepatitis virus infection. Analysis of constitutional variation identified three susceptibility loci including the class II MHC complex, whose protein products present antigen to T-cell receptors and mediate immune surveillance. Statistical analysis of biologic networks identified variation in the “antigen presentation and processing” pathway as being highly significantly associated with HCC (P = 1 × 10−11). SNP analysis identified two variants whose allele frequencies differ significantly between HCC and LC. One of these (P = 1.74 × 10−12) lies in the PTEN homolog TPTE2. Conclusion: Combined analysis of CNV, individual SNPs, and pathways suggest that HCC susceptibility is mediated by germline factors affecting the immune response and differences in T-cell receptor processing. (HEPATOLOGY 2010)
Primary liver cancer is the third most common worldwide cause of cancer-related deaths, with a rising incidence in Western countries. The highest incidence in the world occurs in Korea, where the rate among males is 44.9/100,000.1, 2 Hepatocellular carcinoma (HCC) is responsible for 85%-90% of primary liver cancers, with a high incidence rate (35-50/100,000 in males) in Asian countries like China and South Korea. HCC is associated with several major risk factors including chronic hepatitis B and C infection, consumption of aflatoxin-contaminated foods, excessive consumption of alcohol, and liver cirrhosis (LC).3-5 Both the variability in outcome following the same environmental exposure and the clustering of HCC within families suggest genetic susceptibility.6-8
Genetic analysis of HCC susceptibility, to date, has centered on examination of individual candidate genes whose variation may plausibly influence the response to known environmental risk factors.6, 9, 10 Recent technological advances have made it feasible to perform comprehensive, genome-wide searches for genetic factors associated with disease susceptibility and progression. These factors include both single nucleotide and copy number polymorphisms. To date, genome-wide analysis of liver cancer has been limited to the examination of HCC tumor tissue and adjacent uninvolved liver tissue which identify somatic changes associated with the tumor.11 Moreover, these studies have largely focused on changes in gene expression measured at the RNA level. To identify susceptibility loci for liver disease, we conducted an association study analyzing single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) in DNA isolated from peripheral blood; for this work we used the Affymetrix SNP 6.0 microarray, which contains 934,968 SNPs and 945,826 structural variation markers.
Our genome-wide association study (GWAS), the first to focus on HCC, revealed that both constitutional genetic variations and somatic genomic events are risk factors for HCC. We observed an association between germline variants in the MHC class II loci and somatic CNV at T-cell receptor loci and liver disease. Our findings provide genomic evidence that genes involved in the immune response play a critical role in the development of liver cancer.
CNV, copy number variation; GWAS, genome-wide association study; HBV, hepatitis B virus; HCC, hepatocellular carcinoma; HCV, hepatitis C virus; LC, liver cirrhosis; SNP, single nucleotide polymorphism; TCR, T-cell receptor; TRA@, T-cell receptor alpha complex; TRG@, T-cell receptor gamma complex.
Materials and Methods
Study Population and Design.
This study involved unrelated HCC and LC patients of Korean ethnicity treated at the Asan Medical Center, Seoul, Korea. Disease diagnosis was confirmed by histopathology. Previous clinical history, enzyme-linked immunosorbent assay (ELISA)-based serum test results for hepatitis B virus (HBV) and hepatitis C virus (HCV), and clinical laboratory data were collected for these individuals; 89% of the HCC cases and 76% of the LC cases were chronically infected with either HBV or HCV.
Two sources of controls were used. The first set of controls for our study was unrelated individuals from the Asan Medical Center. The viral infection status of controls was not ascertained. A second set of controls was HBV+ individuals of Chinese origin (described previously).6 The local ethics committees and all subjects gave informed consent before inclusion in the study.
A total of 386 Korean HCC cases, 86 Korean LC cases, 587 Korean controls (Supporting Table S1), and 100 Chinese controls passed the quality control evaluations (DNA integrity measurement, STRP genotyping for assessment of identity, and high SNP call rate from the Affymetrix 6.0 platform) described below. We confirmed through molecular assays that there is no population stratification among the Korean samples (see Supporting Methods). Individuals from the Korean population set were assigned to the discovery (Stage 1) or validation (Stage 2) group based on their order of enrollment in the study. Stage 1 included 271 controls, 180 HCC cases, and 66 LC cases; Stage 2 had 316 controls, 206 HCC patients, and 20 individuals with LC. Key findings from the two-stage analysis were further validated using the Chinese control samples.
DNA and RNA Preparation.
Peripheral blood DNA was extracted using the Blood DNA Kit (Qiagen, Valencia, CA). DNA integrity and quantity were assessed using the Quantifiler Human DNA Quantification kit (Applied Biosystems, Foster City, CA). Polymerase chain reaction (PCR) products were analyzed with an ABI 3130XL Automated DNA Sequencer and the GeneMapper ID v3.2 software (Applied Biosystems, Foster City, CA).
Affymetrix SNP6.0 Assay.
The Affymetrix SNP6.0 assay was performed according to the manufacturer's instructions (Affymetrix, Santa Clara, CA). Assay runs were performed in 96-well plates containing equal numbers of case and control samples, two Asian HapMap samples (chosen from NA18954, NA18971, NA18603, and NA18995) for external genotype validation, the Affymetrix Affy103 control DNA, and a water blank. Cases were randomly selected for each plate one-by-one using a random-number generator. Each case in the discovery phase was paired with its best match in sex and age among the control samples. Processing each Stage 1 case along with a matched control was aimed at minimizing technical variation in experimental results. Controls in the validation phase have limited clinical information and therefore were selected randomly. Each batch of 47-93 SNP6.0 assays was analyzed with the Affymetrix Genotyping Console v. 3.0 birdseed program. Samples with a global allele call rate below 98.5% were excluded from further analysis. In all, 90.5% of samples had an SNP call rate ≥99%. Genotype and CNV data are deposited in caArray (https://array.nci.nih.gov/caarray/project/bueto-00429).
Copy Number Analysis.
Given the large number of markers examined in a GWAS, it is critical to control for false discovery by validating observations in an independent population. We employed a two-stage discovery-replication study design for our comparison of HCC patients and healthy controls (Supporting Fig. S1). The study population was divided into independent discovery (Stage 1) and validation (Stage 2) sets as described above. Stage 1 and Stage 2 samples were analyzed separately for CNV using the Affymetrix Genotyping Console program with default parameters and the HapMap270 reference model. The resulting copy number log2ratio data served as input for the R DNAcopy package, which implements the circular binary segmentation (CBS) algorithm.12 We converted CBS copy number values to discrete copy number states (high, normal, low) using thresholds two standard deviations from the mean CNV of all autosomal markers in the dataset (described in Supporting Methods). In all, 422,062 nonoverlapping genomic segments were identified in the analysis of the Stage 1 samples. CNV segments associated with HCC were identified using a 2×3 Fisher's exact test. The 2,318 segments with P below 1 × 10−4 in the Stage 1 samples were retested in the Stage 2 samples. For validation, segments had to show an association with disease in the Stage 2 population with a P < 2.157 × 10−5, corresponding to P ≤ 0.05 after Bonferroni adjustment for 2,318 tests. We confirmed that age and gender were not confounding variables in our analysis (Supporting Methods).
Because our study population contains only 86 LC patients, we performed a Fisher's exact test on combined Stage 1 and Stage 2 CNV data from LC patients and healthy Korean individuals to identify copy number variants acting as risk factors for cirrhosis. To be considered significant, the resulting P had to be <0.05 after Bonferroni adjustment for 422,062 comparisons. Analysis aimed at identifying CNV that distinguishes HCC from LC was likewise performed on combined Stage 1 and Stage 2 data. The distribution of high, normal, and low copy number was examined at 208,761 nonoverlapping segments identified through CBS analysis of the 386 HCC and 86 LC individuals.
Genotype calls were generated with the Affymetrix Power Tools apt-probeset-genotype program using default parameters. Files were analyzed in two batches (Stages 1 and 2) to ensure accurate normalization. Noninformative markers, markers with a minor allele frequency below 5% in controls, SNPs for which <95% of samples have a quality score ≥ 90, and SNPs not in Hardy-Weinberg equilibrium in controls (P < 0.001) were excluded from further analysis. To test the association of individual SNPs with HCC, cases and controls were divided into training (Stage 1) and testing (Stage 2) sets as described above (Supporting Fig. S2). Single SNP association analysis was performed with PLINK,13 using a logistical model. The 5,622 SNPs that met a significance threshold of P < 0.01 in the Stage 1 discovery set were subjected to a Cochran-Armitage trend test using data from the Stage 2 population. The significance threshold for the trend test (8.89 × 10−6) was based on a correction for 5,622 comparisons. For cirrhosis, SNP analysis was performed using all LC cases and all controls. Similarly, all HCC and LC cases were used in single SNP analysis aimed at identifying variants that distinguish the two disease states. Linkage disequilibrium (LD) among individual markers was calculated for each chromosome using a C program that implements the LDSelect algorithm.14 SNPs with an r2 correlation ≥0.8 were considered to be in linkage disequilibrium.
The 1,000 SNPs most strongly associated with disease in the single marker association analysis were selected from Stage 1 and Stage 2. Regions of significance were defined by identifying additional SNPs in LD with these markers. The 1,000 SNPs of interest were then assigned to National Cancer Institute (NCI)-curated pathways (http://pid.nci.nih.gov) on the basis of their LD to genes in these pathways. The 1,000 SNPs were then evaluated for statistically significant overrepresentation in pathways using Fisher's hypergeometric density function.15 This test determines the likelihood of the observed number of associations (e.g., seven SNPs observed within the antigen processing pathway) from a finite population (18,504 total SNPs assigned to pathways, among which there are 16 total SNPs within the antigen processing pathway) in a defined number of draws without replacement (1,000 SNPs of interest).
Real-Time Quantitative PCR Assay.
TaqMan real-time PCR assays (Applied Biosystems) were used to confirm the SNP6.0 CNV results for T-cell receptor alpha complex (TRA@) and T-cell receptor gamma complex (TRG@). Details of the assay are in Supporting Table S2. Copy number determination was performed using the standard curve method of absolute quantitation with normalization to albumin (ALB)16 as an internal reference. Standard curves were generated from CEPH controls, B-cell-derived lymphoblastoid cell lines that do not undergo rearrangement at the TCR loci, and thus are diploid for ALB, TRA@, and TRG@.
Validation of SNP Genotypes.
The MHC class II region contains clusters of homologous genes. To verify that the SNP6.0 genotype calls for rs2647073 and rs3997872, SNPs showing the highest association to HCC, were not experimental artifacts, we genotyped these markers using an independent genotyping methodology, the TaqMan assay. TaqMan results were in complete agreement with the SNP6.0 genotypes.
Comparison of CNV in HCC patients and Korean controls reveals that a number of genomic regions showed strong association with disease outcome. We identified eight loci where CNV is significantly associated with HCC. Six of these appear to be germline CNVs. The other two, however, involve T-cell receptor loci, which are known to undergo recombination in peripheral blood lymphocytes, the source of DNA for our study. Of the six loci showing germline CNV, the one exhibiting the strongest association with HCC is a small region of chromosome 1p36.33 that contains no known or predicted genes. In this case, low copy number correlates with increased risk for both HCC (unadjusted P = 5.94 × 10−16 for Stage 1, P = 1.11 × 10−10 for Stage 2; Table 1) and LC (unadjusted P = 6.03 × 10−9 for combined Stage 1 and Stage 2; Table 2). The five other regions for which CNV is associated with HCC contain the genes KNG1 (3q27.3); C4orf29 and LARP2 (4q28.2); ALDH7A1, PHAX, C5orf48, and LMNB1 (5q23.2); SRPK2 and PUS7 (7q22.2); and TMPO (12q23.1). Low copy number at all five of these loci is more frequent in controls than HCC patients (Table 1). We observed no statistically significant association between CNV at these five loci and LC (Table 2). Additionally, none of these loci show significant differences between LC and HCC.
Table 1. Copy Number Variation Associated with HCC
OR.Fisher (95% CI)
P-value < 0.05 after Bonferroni adjustment for 2,318 comparisons.
P-value < 2.6x10−7 after Bonferroni adjustment for 2,318 comparisons.
Among the loci showing association of CNV with HCC, the strongest association is seen at the TRG@ and TRA@. In both cases low copy number is more frequent in controls than cases. In HCC versus controls, TRG@ shows an unadjusted P of 3.16 × 10−21 in the Stage 1 training set and P = 1.85 × 10−28 in the Stage 2 testing set; TRA@ has an unadjusted P = 1.94 × 10−16 in Stage 1 and P = 6.24 × 10−28 in Stage 2 (Table 1). We validated these findings using an independent platform by performing a TaqMan assay (t test P = 2.86 × 10−18 for TRA@; P = 3.56 × 10−26 for TRG@ for combined Stage 1 and Stage 2 samples; Supporting Table S9). CNV at the TRG@ and TRA@ loci also differs significantly between control and LC individuals (unadjusted P of 5.66 × 10−12 and 3.17 × 10−13, respectively, in combined Stage 1 and Stage 2 samples; Table 2). As is seen in HCC, low copy number is more frequent in control than LC individuals.
To confirm our proposal that the observed CNV at TRA@ and TRG@ reflects somatic genomic rearrangement at these loci that occurs in normal T lymphocytes, we inspected publicly accessible CNV data at these T-cell receptor loci in B cells. Because B cells do not exhibit TCR rearrangement, they should be diploid at the TRA@ and TRG@ loci. As expected, neither locus shows CNV in publicly accessible HapMap genotype data, which were generated using DNA isolated from B-cell lymphoblastoid cell lines established at the Centre d'Etude du Polymorphisme Humain (CEPH).17
We observe no significant association between CNV at the T-cell receptor loci and hepatitis virus status in the cases where viral status is known in the current study population (Supporting Table S4). Likewise, the HBV-positive Chinese control individuals were genotyped using the TaqMan assay for the TRA@ locus. TRA@ copy number was observed to be similar in HBV-positive Chinese individuals and our Korean controls (t test P = 0.477), but differed significantly between the HBV-positive Chinese individuals and our HBV-positive Korean cases (P = 6.572 × 10−13) (Supporting Table S5). Hence, we conclude that hepatitis virus infection status, per se, does not account for observed CNV differences between our cases and controls.
Single SNP Analysis.
In our investigation of the association of genomic variation with disease, we also examined individual SNPs in HCC patients and healthy Korean controls. The set of SNPs most strongly correlated with HCC by a trend test was enriched for polymorphisms in genes involved in antigen presentation. Three of the eight variants with the highest association to liver cancer (rs9267673, rs2647073, and rs3997872) lie in the MHC class II locus (Table 3). None of the three variants is in LD with either of the others. The variant rs9267673 is located adjacent to the gene C2. rs2647073 is in LD with SNPs in a set of genes that includes HLA-DRB1, HLA-DRB6, HLA-DRB5, and HLA-DRA. The SNP rs3997872, on the other hand, is in LD with SNPs in the HLA-DQA1, HLA-DQB1, HLA-DQA2, and HLA-DQB2 loci. All three SNPS are independently associated with HCC, showing neither an additive nor multiplicative effect.
Table 3. Top 10 SNPs Associated with Hepatocellular Carcinoma (HCC) in the Replication Study
Distance to Gene
OR (95% CI)
The SNP is in LD with the upstream genes including HLA-DRB1.
The SNP is in LD with the genes downstream of HLA-DRB1.
Interestingly, in addition to their association with HCC, two of the three SNPs (rs9267673 and rs2647073) show association to LC (P 0.0052 and 0.0007, respectively). In contrast, rs3997872 is only weakly associated with LC (P is 0.0408) (Supporting Table S3).
Comparison of SNP allele frequencies in HCC and LC patients, identified two variants that distinguish liver cancer from cirrhosis (Table 4). An SNP, rs2880301, is located within the TPTE2 gene; the second, rs2551677, lies in a gene-poor region of 2q14.1. Both polymorphisms are distinct from those identified in the comparison of HCC patients and Korean controls.
Table 4. SNPs That Distinguish Hepatocellular Carcinoma (HCC) from Liver Cirrhosis (LC)
Distance to Gene
Minor Allele Frequency
OR (95% CI)
We also examined HCC individuals to determine whether risk alleles at SNPs associated with cancer (Table 3) correlate with hepatitis virus infection status. All eight variants show an adjusted P > 0.11 for association with HBV and an adjusted P > 0.48 for association with HCV (Supporting Table S6). Thus, viral infection status does not account for the observed association between SNPs in LD to immune response genes and liver cancer. Finally, we observe no significant association between SNPs in HLA-DP, which has been implicated in HBV susceptibility in Asian populations18 and HCC.
We next evaluated whether multiple SNPs in a common biological network, each with a modest individual effect, were associated with HCC. In order to reduce complexity and statistical noise, we first selected the 1,000 most significant SNPs from Stage 1 and Stage 2 and assigned them to biological pathways based on their linkage disequilibrium to genes in the NCI Protein Interaction Database. We then tested whether any biological pathways were overrepresented in this set of 1,000 SNPs. The results, summarized in Table 5, show that “antigen processing and presentation” is the pathway most strongly associated with HCC in combined Stage 1 and Stage 2 data, with an unadjusted P of 1 × 10−11. We next examined the relationship between SNPs in antigen processing loci and CNV at the T-cell receptor loci. We found that multiple SNPs at the HLA-DQB2 locus were associated with CNVs at the TCR loci. Allelic variants of rs9276427, rs28420297, rs9276429, and rs9276490 are correlated with CNVs at TCR-gamma, all with P below 5 × 10−4.
Table 5. Biological Pathways Identified by Gene-Enrichment Analysis of Significant SNPs
We performed a multidimensional genomic analysis of HCC and LC, examining the association of CNVs, individual SNPs, and genetic pathways to liver disease. Our GWAS, the first to focus on HCC, reveals that both constitutional genetic variations and somatic genomic events behave as risk factors for HCC.
HCC is frequently preceded by cirrhosis. Because only a subset of LC patients develop HCC, it is of great interest to identify factors that affect the transition from LC to cancer. We identified two SNPs whose allele frequencies differ significantly between HCC and LC (Table 4). The first is located in 2q14.1, ≈175 kb from the nearest gene. The second variant lies within an intron of TPTE2, which encodes a homolog of the PTEN tumor suppressor protein.19 Our study is the first to suggest TPTE2 is involved in carcinogenesis.
Our analysis of HCC patients and healthy individuals identified six loci where inherited CNV is strongly associated with HCC (Table 1). Several of these have functions plausibly related to the etiology of HCC. SPRK2 encodes a product reported to phosphorylate the HBV core protein.20 Work by Zheng et al.21 suggests that SRPK2 can inhibit HBV replication.
Two loci where CNV is associated with liver cancer may play roles in tumorigenesis. TMPO encodes a protein that regulates the subnuclear localization of Rb. Knockdown of TMPO in fibroblasts disrupts cell cycle progression22, 23; elevated expression of the gene product has been observed in a variety of primary tumors.24 Consistent with its apparent role in promoting tumor formation, low TMPO copy number is associated with reduced HCC risk. In contrast to TMPO, increased copy number in a small region of 1p36.33 is associated with reduced HCC risk. Deletions at 1p36 have been reported in a wide variety of cancers.25-28 Although the 1p36.33 CNV region contains no known or predicted genes, the region does show homology to the mitochondrial genome.29 We are undertaking further analyses to determine whether the observed 1p36.33 CNV reflects variation in mitochondrial or chromosomal DNA.
The most striking outcome of our analysis of SNPs and CNVs is that germline variation may modulate somatic immune events that drive HCC susceptibility. First, we found that reduced copy number at the KNG1 locus, which encodes a protein that promotes T-cell senescence,30 is more frequent in healthy individuals than in liver cancer patients.
Further reinforcing the role of the immune system, individual SNP analyses reveal that the MHC class II locus contains three variants (rs9267673, rs2647073, and rs3997872) strongly associated with HCC. MHC class II molecules present antigen to CD4+ (helper) T cells.31 The three SNPs may be associated with altered MHC class II proteins that result in an ineffective T-cell response. Interestingly, rs2647073 lies 3.4 kb from rs660895, an SNP recently identified as a risk factor for the autoimmune liver disease biliary cirrhosis.32 Analysis of SNP allele distributions in pathways further reinforces this observation. In multiple SNP analysis, “antigen processing and presentation” emerged as the pathway with the strongest association with HCC. Among the SNPs in this pathway, multiple variants at the HLA-DQB2 locus were observed to be associated with CNVs at the TCR loci.
Analysis of copy number variation at TCR gene complexes supports the findings from the SNP analyses. Healthy individuals, on average, have lower copy number at the T-cell receptor loci TRA@ and TRG@ than do persons with HCC (Fig. 1). T-cell maturation involves TCR gene rearrangements that eliminate large portions of the T-cell receptor loci. Thus, successful T-cell receptor rearrangements appear to occur less frequently in cancer patients. Because TCR CNV is absent in DNA samples derived from liver tissue or immortalized B cells, the observed findings are attributable to somatic events occurring in T lymphocytes. CNV patterns at TRA@ suggest that rearrangement events generate functional alpha chain more frequently than delta chain. Low copy number segments observed in individual samples frequently encompass the TCR delta constant region, but rarely include the TCR alpha constant region (Fig. 2).
Support for the idea that altered T-cell activation contributes directly to carcinogenesis in the liver, rather than simply being a systemic reaction to cancer, comes from the strong association we see between CNV at the T-cell receptor loci and liver cirrhosis, a risk factor for and precursor to HCC (Table 2). Two of the three MHC class II locus SNPs whose genotypes correlate with HCC, rs9267673 and rs2647073, also exhibited strong association with LC (Table 3; Supporting Table S4).
Although the role of the immune system in constitutional susceptibility to HCC is new, the involvement of the immune system in HCC carcinogenesis has been previously suggested in clinical studies and research involving model organisms. Increased activity of helper T cells, which promote inflammation, is associated with HCC.33 Conversely, activation and proliferation of cytotoxic T lymphocytes is suppressed in individuals with HCC.34, 35 Further, chronic inflammation has been implicated in the development of liver cancer in both animal models and in humans.36-38
Our work complements and extends these findings by providing a genetic basis for the clinical observations and extending the findings to HCC susceptibility. Our results indicate that germline polymorphisms at the MHC class II locus may affect the generation and proliferation of T cells, with particular rearrangement patterns at TCR loci. From this observation, we propose that the T-cell repertoire of each individual plays a critical role in liver cancer susceptibility and that biological processes affecting T-cell maturation or immune surveillance may represent important etiologic mechanisms for the development of HCC in humans.
With additional validation, the findings of this study may have additional practical clinical benefit. Using DNA obtained from peripheral blood it is possible to assess the status of the germline polymorphisms at the MHC class II loci. Such an assay may allow identification of individuals at increased risk of HCC for more intensive follow-up and monitoring. Similarly, TCR copy number status can be assessed using peripheral blood and an inexpensive TaqMan assay. With validation, this simple test could serve as a noninvasive screen for HCC. Ongoing work will focus on the development of a sensitive and accurate HCC classifier based on CNV loci identified in our study.
We thank Drs. Dinah Singer, Jung-Hyun Park, Katherine McGlynn, and Lalage Wakefield for critical reading of the article. We thank Dr. Barbara Dunn for insightful and valuable comments on the article. We thank Gretchen Carpintero, Grace Yanagawa, and Dhivya Jayaraman for technical assistance. Peripheral blood samples from HBV(+) Chinese individuals enrolled the Hiamen City Anti-epidemic Study were kindly provided by Drs. W. Thomas London, Alison Evans, Gang Chen, Wen-Yao Lin, and Fu-Min Shen.