Nucleotide excision repair genes and risk of lung cancer among San Francisco Bay Area Latinos and African Americans



Few studies on the association between nucleotide excision repair (NER) variants and lung cancer risk have included Latinos and African Americans. We examine variants in 6 NER genes (ERCC2, ERCC4, ERCC5, LIG1, RAD23B and XPC) in association with primary lung cancer risk among 113 Latino and 255 African American subjects newly diagnosed with primary lung cancer from 1998 to 2003 in the San Francisco Bay Area and 579 healthy controls (299 Latinos and 280 African Americans). Individual single nucleotide polymorphism and haplotype analyses, multifactor dimensionality reduction (MDR) and principal components analysis (PCA) were performed to assess the association between 6 genes in the NER pathway and lung cancer risk. Among Latinos, ERCC2 haplotype CGA (rs238406, rs11878644, rs6966) was associated with reduced lung cancer risk [odds ratio (OR) of 0.65 and 95% confidence interval (CI): 0.44–0.97], especially among nonsmokers (OR = 0.29; 95% CI: 0.12–0.67). From MDR analysis, in Latinos, smoking and 3 SNPs (ERCC2 rs171140, ERCC5 rs17655 and LIG1 rs20581) together had a prediction accuracy of 67.4% (p = 0.001) for lung cancer. Among African Americans, His/His genotype of ERCC5 His1104Asp (rs17655) was associated with increased lung cancer risk (OR = 1.78; 95% CI: 1.09–2.91), and LIG1 haplotype GGGAA (rs20581, rs156641, rs3730931, rs20579 and rs439132) was associated with reduced lung cancer risk (OR = 0.61; 95% CI: 0.42–0.88). Our study suggests different elements of the NER pathway may be important in the different ethnic groups resulting either from different linkage relationship, genetic backgrounds and/or exposure histories. © 2008 Wiley-Liss, Inc.

Nucleotide excision repair (NER) has been well described and is 1 of the 3 DNA repair pathways cells used to repair DNA base damage.1, 2 Despite numerous publications of the association of several NER genetic polymorphisms and lung cancer risk,3–41 only 3 studies included African Americans7, 8, 31 and only 2 included Latinos.7, 31 Although 80–90% of lung cancer is attributable to smoking,42 smoking patterns may not fully explain the difference in lung cancer incidence, particularly among African Americans,43, 44 who have the highest lung cancer rates in the United States.45 This suggests that ethnic differences in the incidence rates of lung cancer may be partially explained by inherited variations among different ethnic/racial groups. Therefore, this study examines the association between ERCC2, ERCC4, ERCC5, LIG1, RAD23B and XPC and lung cancer risk in these 2 understudied populations, African Americans and Latinos (who have the lowest lung cancer rates in the United States).45 We used logistic regression of individual candidate SNPs and haplotypes as well as principal components analyses (PCA) and multifactor dimensionality reduction (MDR) to thoroughly explore genetic associations and gene-environment interactions with lung cancer risk. Moreover, to control for potential population stratification in these admixed populations,46 all analyses in this study were adjusted for individual genetic ancestry determined by a panel of 184 ancestry informative markers.

Material and methods

Study subjects

Cases were identified through the Northern California Cancer Center's rapid case ascertainment program and included San Francisco Bay Area residents newly diagnosed with primary lung cancer between September 1998 and March 2003. Subjects' treating physicians were sent a letter asking whether subjects had any contraindications to participate in the study. If no contraindications were indicated by the physicians, subjects were sent a letter describing the purpose of the study and a postcard to return if they did not want to participate. Subjects who did not refuse participation were telephoned for a short interview to obtain information on ethnicity, prediagnostic smoking history, occupational history and dietary habits. Self-identified Latinos or African Americans were individually asked to participate in a more detailed in-person interview and to donate blood or buccal specimens.

Recruitment of control subjects has been described in detail previously.47 Briefly, control subjects were recruited through 3 sources: random-digit dialing, Health Care Financing Administration records and community-based recruitment (e.g. health fair, churches and senior centers). Controls were frequency matched to cases on age, gender and race/ethnicity (Latino or African American) with a control to case ratio of ∼2 to 1. Control subjects completed in-person interviews and donated a blood and/or buccal specimen.

The study was approved by the Committee on Human Research of the University of California, San Francisco and by the Institutional Review Boards of all collaborating institutions.


NER pathway genes and SNP selection

This analysis includes 17 single nucleotide polymorphisms (SNPs) belonging to 6 NER genes (ERCC2, ERCC4, ERCC5, LIG1, RAD23B and XPC) and 1 SNP belonging to PPP1R13L, which forms a haplotype block with several SNPs of the ERCC2 gene, but is not involved with NER (SNPs are listed in Supplemental Table I). SNPs were selected using a candidate gene approach and were drawn from multiple sources. A number of SNPs (rs13181, rs1052555, rs3916876 and rs238406) were identified in ERCC2 from the literature,48–50 and rs17655, rs1805329, rs1800067 and rs2228001 were selected for their potential influence on DNA repair pathways.51 The SNP500Cancer database52 was queried for SNPs appearing in candidate genes in the combined 102 individual SNP500 population with a minor allele frequency (MAF) >5%; SNPs rs1799787, rs20581, rs156641, rs3730931, rs20579 and rs439132 were selected in this manner. Finally the HapMap database53 was used to generate haplotypes from candidate genes and their flanking 10,000bp regions in Yoruba West Africans from Ibadan, Nigeria (YRI) and CEPH (Utah residents with ancestry from northern and western Europe) populations. Rs1799793, rs171140 and rs11878644 were identified as tag SNPs in the CEPH dataset, possessing an MAF > 5%.

Ancestry informative markers

In addition to the SNPs of the NER genes, a panel of biallelic SNPs designed by coauthor M. Seldin were genotyped to account for the potential population stratification among Latinos and African Americans, 2 admixed populations. European ancestral DNA was collected from 47 white European descent Caucasians who were healthy controls from an ongoing population-based cancer study in SF Bay Area.54 African ancestral DNA (N = 47) was provided by coauthor R. Kittles and was collected from 23 subjects from the Bini, a Niger-Congo group of Bantu speakers from Edo State and 24 subjects from the Kanuri, a group of Nilo Saharan speakers from the Lake Chad region of northern Nigeria. Amerindian ancestral DNA (N = 46) was provided by coauthor G. Silva and was collected from Mayans living in 2 villages, Bola De Oro and Cienega Grande, from Chimaltenango. One hundred eighty-four unlinked autosomal SNPs with large differences in allele frequencies between ancestral populations were identified as ancestry informative markers (mean difference in allele frequencies ranged from 0.43 to 0.49). Genetic ancestry (percent European, Amerindian and African ancestry) was estimated using these 184 ancestry informative markers and a maximum likelihood-based program written in R specifically developed for this project based on the methods described by Chakraborty et al.55 and Hanis et al.56

Genotyping platform

Genotyping was performed on an Illumina BeadStation 500G Golden Gate genotyping platform with a custom panel of 384 candidate and ancestry informative SNPs and unamplified DNA extracted from blood. For 6 subjects with insufficient DNA from blood, genotypes from whole genome amplified blood or buccal DNA samples are included in the dataset. Whole genome amplification (WGA) was performed as previously described.57 Genotype reproducibility was verified with duplicates of unamplified DNA (N = 31) and WGA/genomic DNA pairs. Unamplified duplicates averaged 99.99% was reproducible over a range of 99.86–100%. Depending on whether WGA was amplified from blood (N = 18 pairs tested) or buccal derived DNA (N = 28 pairs tested), WGA/genomic pairs respectively averaged 99.39% (98.93–99.60%) and 98.49% (96.11–99.73%) genotype reproducibility.

All Latino (n = 131) and African American (n = 267) cases were genotyped along with all available Latino controls (n = 308). Because of budget constraints, we selected a random sample (n = 290) of African American controls for genotyping. For this analysis, we excluded subjects who reported belonging to other ancestral/ethnicity groups in addition to Latino or African American. The final sample of this study consists of 412 Latino subjects (113 cases and 299 controls) and 535 African American subjects (255 cases and 280 controls).

Statistical analysis

All analyses were conducted separately within the 2 ethnic groups, Latino and African American. We calculated allele frequencies for all NER SNPs and excluded from further analysis those with a MAF less than 5%. Tests of Hardy-Weinberg equilibrium were performed for each SNP by using the exact test in the Proc Allele procedure of SAS Genetics (Cary, NC) and SNPs failing Hardy Weinberg test with a false discovery rate (FDR) <0.05 (after adjustment for multiple comparisons) were excluded.

Further analyses were performed in the following order:

  • 1Logistic regressions adjusted for age, gender and percent of European and Amerindian ancestry for single SNPs and haplotypes for genes (ERCC2 and LIG1) with more than 1 SNP included on the assay panel. We did not adjust for income or education as these 2 variables do not directly determine one's genotypes of NER genes, and therefore, do not satisfy the definition of a confounder on the relationship between NER genes and lung cancer. Although smoking is an established risk factor for lung cancer, we do not think it is a confounder on the association between NER gene and lung cancer, as smoking also does not directly determine one's genotype. Smoking may be indirectly associated with NER genes through race/ethnicity, but this has been accounted by our analyses with the stratification on race/ethnicity and the adjustment for genetic ancestry. In addition, we did not observe any significant correlation between the number of variant allele for any of the SNPs included in this study and smoking (in pack years) among our control subjects (data not shown). However, we performed sensitivity analysis and adjusted for smoking and the results were similar.
  • 2To account for potential false positive results, we calculated false positive report probability (FPRP) for the significant results (p < 0.05). FPRP is defined as the probability of no true association between a genetic variant and disease given in the statistically significant finding; the magnitude of FPRP depends on the prior probability of the association between a genetic variant and the disease, and the statistical power of the test.58 As previous studies suggested an association between NER pathways and lung cancer, we assigned moderate-high prior probabilities of 0.10–0.25 and consider FPRP <0.50 as a finding that warrants replication by future studies. As other investigators may have different views on the degree of prior probabilities, FPRPs were also calculated using prior probabilities of 0.01 and 0.001.
  • 3Exploratory assessment of interactions between NER genes and smoking using logistic regression and MDR.
  • 4Principal component analysis as an alternative method to haplotypes for capturing multi-SNP variation and for assessing gene-smoking interaction.

Single SNP and haplotype analysis

Unconditional logistic regression was performed with each individual SNP without assuming any mode of inheritance by including 2 index variables in the model (1 for heterozygous variant and 1 for homozygous variant genotype). In addition, tests for trend were performed using the log-additive model by coding the copies of minor alleles as 0, 1 and 2.

Haplotype blocks for 2 genes (ERCC2 and LIG1) with more than 1 SNP genotyped were determined by Haploview version 3.32,59 using the block definition as described by Gabriel et al.60 Haplotype analysis was then performed with each haplotype block. Haplotypes were estimated from SNPs belonging to the same haplotype block by expectation-maximization (EM) algorithm using the SAS macro HAPPY written by Kraft and Chen ( HAPPY SAS macro includes the SAS PROC HAPLOTYPE with the “stepem” option based on the haplotype estimation SNPHAP program by David Clayton.61 A study by Adkins compared 4 methods of estimating haplotypes, including Phase and SNPHAP and showed that all 4 methods performed equally well.62 Haplotypes with frequency less than 5% were combined into 1 group for analysis. Haplotype trend regressions were performed to estimate the odds ratio (OR) associated with having one-copy increment of a specific haplotype using the most common haplotype as the reference group.63, 64 To account for the uncertain phases of haplotypes, the probabilities of having different haplotype combinations were incorporated as weights in the regression model. Global tests for the association between haplotypes of a haplotype block and lung cancer were performed comparing the full model with the haplotype variables to the submodel without the haplotype variables using the log-likelihood ratio test.

Interaction between individual SNPs or haplotypes and smoking

To assess gene-smoking interaction, analyses stratified by smoking status were performed with ERCC2 and LIG1 haplotypes or individual ERCC4, ERCC5, RAD23B and XPC SNPs with a main effect p ≤ 0.10. Tests for interaction were performed by including product terms between smoking status and SNPs or haplotypes in the unconditional logistic regression model. p-value for interaction was obtained by log-likelihood ratio test comparing the full model with product terms to the submodel without the product terms.

Interaction between multiple SNPs and smoking using multifactor dimensionality reduction analysis

Multifactor dimensionality reduction (MDR) analysis was performed to assess high-order smoking-SNP and SNP–SNP interactions. Subjects with missing data on at least 1 SNP were excluded from the MDR analysis (12 Latinos and 20 African Americans). A detailed description of MDR has been published previously.65, 66 Briefly, MDR is a nonparametric method which reduces n-dimensional data to a single dimensional variable with 2 levels (high vs. low risk). The MDR procedure performs exhaustive searches of all possible combinations of n genetic/environmental factors and the best combination of n-factors is the one with the highest prediction accuracy and the highest cross validation consistency. For this analysis, we allowed MDR to choose up to 4 variables among all qualified SNPs and smoking status (ever vs. never). We repeated the 10-fold cross validation 10 times using 10 different random seeds to reduce the probability of spurious findings due to chance division of the data. p-values were calculated by permutation testing with 1,000 permutations. The best combination of n factors was then included in the unconditional logistic regression model as a dichotomous predictor (high vs. low risk) to determine the associated OR while adjusting for age, sex and genetic ancestry.

Principal components analysis

As an alternative to haplotype analysis, principal components analysis (PCA) was performed with ERCC2 and LIG1 SNPs using a method described by Gauderman et al.67 The PCA method captures the linkage-disequilibrium pattern within a gene but does not require one to estimate haplotypes with unknown phase.67 Simulations showed that PCA is as or more powerful than both genotype- and haplotype-based approaches.67 First, PCA was performed to generate principal components that capture the correlation structure between SNPs within a gene. Then, the principal components that explained at least 80% of the variance were modeled for their association with lung cancer status by logistic regression. The 80% cut-off was shown to have sufficient statistical power according to Gauderman et al.67 SNPs that are strongly correlated with the principal components that are significantly associated with lung cancer risk are thought to be the important SNPs (or linked to the important SNPs) for disease susceptibility. Tests for interaction between smoking and principal components generated from PCA were also performed.


Among Latinos, cases were more likely to have ever smoked, smoked more pack-years, had higher income and had a higher mean percentage of European ancestry and a lower mean percentage of Amerindian ancestry compared with controls (Table I). Among African Americans, cases were more likely than controls to have ever smoked, smoked more pack-years and had fewer years of schooling, but notably the percentages of European and African genetic ancestry were very similar for cases and controls.

Table I. Demographic and Smoking Characteristics of the Study Participants, the San Francisco Bay Area Lung Cancer Study 1998–2003
 Latino AmericanAfrican American
 Case (n = 113)Controls (n = 299)p-valueCase (n = 255)Controls (n = 280)p-value
  • 1

    Percentage of genetic ancestry was determined using 186 ancestry informative markers.

  • 2

    Median income.

Mean ± SE
 Age65.85 ± 1.1066.30 ± 0.660.7363.51 ± 0.6861.81 ± 0.680.08
 Years of schooling10.07 ± 0.4011.03 ± 0.280.0612.32 ± 0.2113.64 ± 0.19<0.0001
 Genetic ancestry1
 % European60.24 ± 1.8353.99 ± 0.980.001518.98 ± 0.8720.09 ± 0.830.35
 % Amerindian32.78 ± 1.7438.30 ± 0.950.00353.01 ± 0.232.49 ± 0.200.08
 % African6.99 ± 0.767.71 ± 0.430.3978.01 ± 0.8877.42 ± 0.810.62
 Pack-years smoked33.79 ± 3.399.25 ± 0.99<0.000133.44 ± 1.5316.04 ± 1.36<0.0001
Percentage (%)
 Ever smoked71.747.2<0.000194.167.9<0.0001
 Household income ≥ $20,000261.146.60.0250.952.50.71

Single SNP analysis

One SNP (ERCC2 rs3916876) among Latinos and 3 SNPs (ERCC2 rs3916876, ERCC4 rs1800067 and RAD23B rs1805329) among African Americans were excluded from analysis due to MAF < 0.05 (Supplemental Table I). All SNPs were in Hardy-Weinberg equilibrium except for rs238406 (p = 0.01) among Latino controls and rs20581 (p = 0.03) among African American controls; however, the deviation of those 2 SNPs from Hardy-Weinberg equilibrium could be due to chance after accounting for multiple testing (FDR >0.05), and thus, they were kept in the analyses (FDRs for rs238406 among Latino controls and rs20581 among African American controls were 0.44 and 0.58, respectively).

Among Latinos (Table II), 2 of the 17 SNPs tested were significantly associated with risk of lung cancer (p < 0.05); these were ERCC2 rs13181 (Lys751Gln) and PPP1R13L rs6966, which forms a haplotype block with several ERCC2 SNPs. Among African Americans (Table II), 3 of 15 SNPs were significantly associated with risk of lung cancer; these were ERCC5 rs17655 (Asp1104His), and LIG1 rs20579 and rs439132. We performed sensitivity analyses with these significant SNPs adjusting for smoking as an additional covariate, and the results were either similar or more statistically significant (see footnote of Table II).

Table II. Analysis of Nucleotide Excision Repair SNPs Among Latinos and African Americans, the San Francisco Bay Area Lung Cancer Study 1998–2003
SNPsLatinosAfrican Americans
Cases N (%)Controls N (%)OR1p-valueCases N (%)Controls N (%)OR1p-value
  • 1

    ORs adjusted for age, sex and genetic ancestry using unconditional logistic regression.

  • 2

    ORs adjusted for smoking (pack years) in addition to age, sex and genetic ancestry using unconditional logistic regression: Latinos: ERCC2 rs13181 CC (Gln/Gln), OR = 2.61, 95% CI: 1.07–6.35, p = 0.04; PPP1R13L rs6966 number of A allele, OR = 0.63, 95% CI: 0.42–0.95, p = 0.03. African Americans: ERCC5 rs17655 CC (His/His), OR = 2.13, 95% CI: 1.25–3.62, p = 0.005; rs17655 number of C allele, OR = 1.45, 95% CI: 1.11–1.89, p = 0.006; LIG1 rs20579 AA, OR = 0.39, 95% CI: 0.17–0.86, p = 0.02; rs20579 number of A allele, OR = 0.69, 95% CI: 0.51–0.93, p = 0.01; rs439132 number of G allele, OR = 1.60, 95% CI: 1.17–2.20, p = 0.003.

  • 3

    Analysis was not performed since minor allele frequency was less than 0.05.

 rs13181 (Lys751Gln)
  AA (Lys/Lys)63 (55.8)182 (60.9)Reference 143 (56.1)168 (60.0)Reference 
  AC (Lys/Gln)36 (31.9)103 (34.5)0.93 (0.57–1.52)0.78100 (39.2)98 (35.0)1.23 (0.86–1.77)0.25
  CC (Gln/Gln)14 (12.4)14 (4.7)2.53 (1.12–5.70)20.0312 (4.7)14 (5.0)1.01 (0.45–2.28)0.98
 Trend  1.28 (0.91–1.82)0.16  1.13 (0.84–1.51)0.41
 rs1052555 (Asp711Asp)
  GG67 (59.3)199 (66.6)Reference 191 (74.9)210 (75.0)Reference 
  AG38 (33.6)92 (30.8)1.12 (0.69–1.82)0.6462 (24.3)66 (23.6)1.04 (0.69–1.57)0.84
  AA8 (7.1)8 (2.7)2.38 (0.84–6.72)0.102 (0.8)4 (1.4)0.58 (0.10–3.29)0.54
 Trend  1.29 (0.88–1.90)0.19  0.99 (0.68–1.44)0.95
  GG68 (60.7)198 (66.4)Reference 193 (76.9)213 (76.1)Reference 
  AG35 (31.3)90 (30.2)1.05 (0.64–1.71)0.8657 (22.7)64 (22.9)0.98 (0.64–1.48)0.91
  AA9 (8.0)10 (3.4)2.28 (0.87–5.98)0.091 (0.4)3 (1.1)0.44 (0.05–4.29)0.48
 Trend  1.26 (0.87–1.83)0.23  0.93 (0.63–1.38)0.72
  AA36 (31.9)85 (28.4)Reference 185 (72.6)202 (72.1)Reference 
  AC42 (37.2)152 (50.8)0.62 (0.37–1.06)0.0864 (25.1)74 (26.4)0.93 (0.63–1.39)0.73
  CC35 (31.0)62 (20.7)1.30 (0.73–2.32)0.386 (2.4)4 (1.4)1.77 (0.48–6.45)0.39
 Trend  1.12 (0.83–1.52)0.46  1.02 (0.72–1.45)0.90
 rs1799793 (Asp312Asn)
  GG (Asp/Asp)60 (55.6)192 (64.7)Reference 186 (75.3)212 (76.5)Reference 
  AG (Asp/Asn)40 (37.0)93 (31.3)1.24 (0.77–2.02)0.3758 (23.5)60 (21.7)1.08 (0.71–1.64)0.72
  AA (Asn/Asn)8 (7.4)12 (4.0)1.94 (0.75–5.01)0.173 (1.2)5 (1.8)0.72 (0.17–3.10)0.66
 Trend  1.32 (0.91–1.91)0.15  1.02 (0.70–1.48)0.91
 rs238406 (Arg156Arg)
  CC41 (36.6)113 (38.1)Reference 190 (74.8)207 (74.2)Reference 
  AC37 (33.0)123 (41.4)0.74 (0.44–1.25)0.2558 (22.8)68 (24.4)0.91 (0.60–1.37)0.64
  AA34 (30.4)61 (20.5)1.46 (0.83–2.55)0.196 (2.4)4 (1.4)1.75 (0.48–6.39)0.40
 Trend  1.17 (0.88–1.57)0.28  1.01 (0.70–1.44)0.97
  AA40 (35.4)85 (28.4)Reference 11 (4.3)10 (3.6)1.26 (0.51–3.07)0.62
  AG55 (48.7)153 (51.2)0.78 (0.48–1.28)0.3380 (31.4)89 (31.8)1.02 (0.70–1.48)0.94
  GG18 (15.9)61 (20.4)0.67 (0.35–1.29)0.23164 (64.3)181 (64.6)Reference 
 Trend  0.81 (0.59–1.12)0.20  1.06 (0.78–1.44)0.73
  TT71 (62.8)145 (48.5)Reference 37 (14.5)47 (16.9)1.04 (0.61–1.77)0.89
  AT37 (32.7)126 (42.1)0.65 (0.41–1.04)0.07139 (54.5)127 (45.7)1.42 (0.97–2.09)0.07
  AA5 (4.4)28 (9.4)0.42 (0.15–1.13)0.0979 (31.0)104 (37.4)Reference 
 Trend  0.65 (0.45–0.94)20.02  1.09 (0.84–1.41)0.52
 rs1800067 (Arg415Gln)    3   
  GG (Arg/Arg)97 (85.8)267 (89.3)Reference     
  AG (Arg/Gln)16 (14.2)31 (10.4)1.47 (0.76–2.84)0.25    
  AA (Gln/Gln)0 (0.0)1 (0.3)    
 Trend  1.32 (0.70–2.50)0.39    
 rs17655 (Asp1104His)
  GG (Asp/Asp)60 (53.1)138 (46.2)Reference 68 (26.7)93 (33.2)Reference 
  CG (His/Asp)44 (38.9)127 (42.5)0.81 (0.51–1.28)0.36119 (46.6)138 (49.3)1.17 (0.78–1.74)0.45
  CC (His/His)9 (8.0)34 (11.4)0.64 (0.29–1.43)0.2868 (26.7)49 (17.5)1.78 (1.09–2.91)20.02
 Trend  0.80 (0.57–1.13)0.20  1.32 (1.03–1.69)20.03
 rs20581 (Asp802Asp)
  GG38 (33.6)89 (29.7)Reference 176 (69.0)199 (71.1)Reference 
  AG48 (42.4)151 (50.5)0.74 (0.45–1.24)0.2573 (28.6)68 (24.3)1.27 (0.86–1.89)0.24
  AA27 (23.9)59 (19.7)1.08 (0.59–1.97)0.806 (2.4)13 (4.6)0.56 (0.21–1.52)0.25
 Trend  1.01 (0.74–1.37)0.95  1.03 (0.75–1.42)0.85
  GG59 (52.2)143 (47.8)Reference 189 (74.1)215 (76.8)Reference 
  AG43 (38.1)126 (42.1)0.80 (0.50–1.28)0.3462 (24.3)60 (21.4)1.14 (0.75–1.72)0.54
  AA11 (9.7)30 (10.0)0.81 (0.38–1.75)0.604 (1.6)5 (1.8)0.95 (0.24–3.73)0.94
 Trend  0.86 (0.62–1.21)0.39  1.10 (0.76–1.58)0.63
  AA79 (69.9)226 (75.6)Reference 151 (59.5)158 (56.4)Reference 
  AG30 (26.6)67 (22.4)1.35 (0.81–2.25)0.2592 (36.2)103 (36.8)0.92 (0.64–1.33)0.67
  GG4 (3.5)6 (2.0)2.04 (0.55–7.54)0.2911 (4.3)19 (6.8)0.56 (0.25–1.23)0.15
 Trend  1.38 (0.90–2.10)0.14  0.84 (0.63–1.12)0.23
  GG72 (63.7)217 (72.6)Reference 150 (58.8)137 (48.9)Reference 
  AG36 (31.9)75 (25.1)1.47 (0.90–2.39)0.1292 (36.1)117 (41.8)0.72 (0.50–1.03)0.07
  AA5 (4.4)7 (2.3)1.99 (0.60–6.56)0.2613 (5.1)26 (9.3)0.42 (0.21–0.87)20.02
 Trend  1.45 (0.97–2.16)0.07  0.68 (0.52–0.90)20.007
  AA108 (95.6)269 (90.0)Reference 129 (50.6)177 (63.2)Reference 
  AG5 (4.4)29 (9.7)0.50 (0.19–1.35)0.17112 (43.9)91 (32.5)1.64 (1.14–2.36)0.007
  GG0 (0.0)1 (0.3)14 (5.5)12 (4.3)1.54 (0.69–3.47)0.30
 Trend  0.49 (0.18–1.30)0.15  1.46 (1.09–1.96)20.01
 rs1805329 (Ala249Val)    3   
  GG (Ala/Ala)59 (52.2)157 (52.5)Reference     
  AG (Ala/Val)43 (38.1)122 (40.8)1.04 (0.65–1.67)0.87    
  AA (Val/Val)11 (9.7)20 (6.7)1.63 (0.72–3.67)0.24    
 Trend  1.17 (0.83–1.66)0.37    
 rs2228001 (Lys939Gln)
  AA (Lys/Lys)46 (41.1)151 (50.7)Reference 143 (56.1)132 (48.2)Reference 
  AC (Lys/Gln)55 (49.1)121 (40.6)1.40 (0.88–2.24)0.1690 (35.3)116 (42.3)0.72 (0.50–1.03)0.07
  CC (Gln/Gln)11 (9.8)26 (8.7)1.38 (0.62–3.06)0.4322 (8.6)26 (9.5)0.76 (0.41–1.41)0.38
 Trend  1.26 (0.89–1.77)0.19  0.81 (0.62–1.05)0.11

Haplotype analysis

For ERCC2, Latinos had 3 haplotype blocks, whereas African Americans had 2 haplotype blocks (Supplemental Figs. 1 and 2). For LIG1, Latinos had 1 haplotype block of 3 SNPs and African Americans had 1 haplotype block of 5 SNPs (Supplemental Figs. 3 and 4).

Among Latinos, reduced lung cancer risk was associated with ERCC2 haplotypes 2B and 3B compared with the most frequent haplotypes (2A and 3A, respectively) (Table III). For African Americans, the most significant result was observed with LIG1 haplotype 1B, which was inversely associated with risk of lung cancer compared with haplotype 1A (Table III). The reduced risk with this haplotype seems to be attributed to the combination of the G allele of rs3730391 and the A allele of rs20579. We performed sensitivity analyses with these significant haplotypes adjusting for smoking as an additional covariate, and the results were either similar or more statistically significant (see footnote of Table III).

Table III. Haplotype Analysis of ERCC2 and LIG1, Latinos and African Americans, the San Francisco Bay Area Lung Cancer Study 1998–2003
LatinosAfrican Americans
 Cases (%)Controls (%)OR1p-value Cases (%)Controls (%)OR1p-value
  • 1

    ORs adjusted for age, sex and genetic ancestry using unconditional logistic regression.

  • 2

    A group with all haplotypes with frequency of <5%.

  • 3

    ORs adjusted for smoking (pack years) in addition to age, sex and genetic ancestry using unconditional logistic regression: Latinos: ERCC2 Haplotype 3B OR = 0.64, 95% CI: 0.42–0.99, p = 0.04; Americans: LIG1 Haplotype 1B OR = 0.54, 95% CI: 0.36–0.82, p = 0.003.

ERCC2    ERCC2    
Block 1    Block 1    
 rs13181 (Lys751Gln) and rs1052555 (Asp711Asp)     rs13181 (Lys751Gln), rs1052555 (Asp711Asp) and rs1799787    
  1A AG71.778.1Reference   1A AGG75.577.5Reference 
  1B CA23.918.11.30 (0.88–1.91)0.18  1B CAA11.912.50.97 (0.66–1.43)0.87
  1C Rare haplotypes24.43.81.23 (0.56–2.70)0.61  1C CGG11.69.31.35 (0.89–2.07)0.16
       1D Rare haplotypes21.00.71.69 (0.44–6.56)0.45
   Global haplotype association p-value: 0.37   Global haplotype association p-value: 0.47
Block2    Block 2    
 rs171140 and rs1799793 (Asp312Asn)     rs238406 (Arg156Arg), rs11878644, rs6966    
  2A CG49.646.2Reference   2A CGA58.260.2Reference 
  2B AG24.634.20.71 (0.49–1.03)0.07  2B CGT21.620.31.08 (0.79–1.48)0.62
  2C AA25.819.61.18 (0.80–1.74)0.40  2C AAT13.713.71.03 (0.71–1.50)0.87
      2D CAT6.35.81.22 (0.72–2.08)0.46
       2E Rare haplotypes20.20.0
   Global haplotype association p-value: 0.06   Global haplotype association p-value: 0.75
 rs238406 (Arg156Arg), rs11878644, rs6966         
  3A AAT46.741.4Reference      
  3B CGA20.430.40.65 (0.44–0.97)30.04     
  3C CGT19.915.61.15 (0.74–1.77)0.54     
  3D CAT12.612.70.95 (0.59–1.54)0.84     
  3E Rare haplotypes20.40.0     
   Global haplotype association p-value: 0.07     
LIG1    LIG1    
 rs20581 (Asp802Asp), rs156641, and rs3730931     rs20581 (Asp802Asp), rs156641, rs3730931, rs20579, and rs439132    
  1A AGA44.244.5Reference   1A GGAGG25.719.5Reference 
  1B GAA28.830.60.91 (0.63–1.31)0.61  1B GGGAA19.223.60.61 (0.42–0.88)30.009
  1C GGG15.913.21.26 (0.80–1.98)0.32  1C GGAGA18.219.60.71 (0.48–1.06)0.09
  1D GGA10.211.20.91 (0.53–1.58)0.74  1D AGAGA16.416.80.77 (0.51–1.14)0.19
  1E Rare haplotypes20.90.51.76 (0.27–11.50)0.55  1E GAAGA12.812.00.81 (0.52–1.27)0.36
       1F Rare haplotypes27.68.60.68 (0.41–1.15)0.15
   Global haplotype association p-value: 0.71   Global haplotype association p-value: 0.18

False positive report probability (Table IV)


Table IV. False Positive Report Probability
Gene/SNP/HaplotypeOR (95% CI)Power2Reported p-valuePrior probability1
  • 1

    False-positive report probabilities for the observed odds ratios.

  • 2

    Estimation of the statistical power to detect an OR of 1.5 or 0.67 with α level equal to the observed p value.

 ERCC2 Lys751Gln Gln/Gln2.53 (1.12–5.70)
 ERCC2 block 3B0.65 (0.44–0.97)0.450.040.190.410.890.99
African Americans
 ERCC5 Asp1104His His/His1.78 (1.09–2.91)
 LIG1 Block 1B0.61 (0.42–0.88)0.320.0090.070.190.720.96


For the proposed prior probabilities of 0.25 and 0.1, the FPRPs for Gln/Gln genotype of ERCC2 Lys751Gln were 0.42 and 0.69, respectively, suggesting a weak to moderate evidence for the association. The FPRPs for ERCC2 block 3B were 0.19 and 0.41 for prior probabilities of 0.25 and 0.1, respectively, suggesting a strong to moderate evidence for the association.

African Americans

For the proposed prior probabilities of 0.25 and 0.1, the FPRPs for His/His genotype of ERCC5 Asp1104His were 0.21 and 0.44, respectively, suggesting moderate evidence for the association. The FPRPs for LIG1 block 1B were 0.07 and 0.19 for prior probabilities of 0.25 and 0.1, respectively, suggesting strong evidence for the association.

Gene-smoking interaction


For ERCC2 block2, haplotype 2B was associated with a statistically significant reduced risk of lung cancer compared with haplotype 2A only among nonsmokers (Table V), but the test for interaction was not statistically significant (p = 0.22). For block 3, haplotype 3B was associated with a reduced risk of lung cancer compared with haplotype 3A only among nonsmokers, and the test for interaction was borderline statistically significant (p = 0.09).

Table V. Haplotype Analysis of ERCC2 by Smoking, Latinos, the San Francisco Bay Area Lung Cancer Study 1998–2003
Never smokedEver smoked
  • 1

    ORs adjusted for age, sex, and genetic ancestry using unconditional logistic regression.

Block 2
 rs171140 and rs1799793 (Asp312Asn)
  2A CGReferenceReference
  2B AG0.47 (0.24–0.93)0.98 (0.60–1.61)
  2C AA1.14 (0.59–2.22)1.48 (0.88–2.50)
 Interaction p-value = 0.22
Block 3
 rs238406 (Arg156Arg), rs11878644, and rs6966
  3A AATReferenceReference
  3B CGA0.29 (0.12–0.67)0.98 (0.59–1.62)
  3C CGT0.90 (0.42–1.95)1.48 (0.83–2.63)
  3D CAT1.37 (0.61–3.12)0.87 (0.45–1.68)
  3E AAA
 Interaction p-value = 0.09

African Americans

For ERCC5 Asp1104His, the risk of those with His/His variant genotype was significantly increased compared with those with Asp/Asp wildtype genotype only among ever smokers (OR = 1.92, 95% confidence interval (CI): 1.10–3.36); however, the test for interaction was not statistically significant (Table VI).

Table VI. ERCC5 rs17655 (Asp1104His) and Haplotype Analysis of LIG1 by Smoking, African Americans, the San Francisco Bay Area Lung Cancer Study 1998–2003
Never smokedEver smoked
  • 1

    ORs adjusted for age, sex, and genetic ancestry using unconditional logistic regressions.

  • 2

    A group with all haplotypes with frequency of <5%.

 rs17655 (Asp1104His)
  GG (Asp/Asp)ReferenceReference
  CG (Asp/His)1.00 (0.26–3.84)1.15 (0.73–1.79)
  CC (His/His)1.21 (0.25–5.92)1.92 (1.10–3.36)
 Interaction p-value = 0.86
 rs20581 (Asp802Asp), rs156641, rs3730931, rs20579, and rs439132
  1A GGAGGReferenceReference
  1B GGGAA0.55 (0.14–2.13)0.61 (0.40–0.94)
  1C GGAGA0.43 (0.10–1.96)0.65 (0.41–1.02)
  1D AGAGA0.50 (0.09–2.76)0.66 (0.42–1.04)
  1E GAAGA1.74 (0.44–6.91)0.69 (0.42–1.15)
  1F Rare haplotypes23.63 (0.87–15.15)0.51 (0.28–0.92)
 Interaction p-value = 0.05

LIG1 haplotype 1B, which was associated with a statistically significant reduced risk of lung cancer in the combined analysis, showed similar ORs across smoking strata (Table VI). Although the test for interaction between LIG1 haplotypes and smoking was borderline statistically significant (p = 0.05), this was mainly attributed to the difference between the risk associated with the rare haplotype groups among different smoking strata, the result of which cannot be easily interpreted. Therefore, it was concluded that there is no evidence of interaction between LIG1 haplotypes and smoking on the risk of lung cancer for the major LIG1 haplotypes.

Multifactor dimensionality reduction analysis


For Latino subjects, the MDR procedure identified smoking, rs171140 of ERCC2, rs17655 of ERCC5 and rs20581 of LIG1 as the best combination for predicting the case/control status (Table VII and Supplemental Fig. 5) with a prediction accuracy of 67.4% and an associated p-value of 0.001, although smoking alone had a good prediction accuracy of 62.5%. The OR associated with the “high risk” group as defined by the best combination (smoking, rs171140, rs17655 and rs20581) was 8.02 (95% CI: 4.67–13.77, p-value < 0.001), adjusting for age, sex, percent of European and Amerindian genetic ancestry using unconditional logistic regression.

Table VII. Multifactor Dimensionality Reduction Analysis for Latinos (17 SNPs of NER Pathway) (N = 400)1, the San Francisco Bay Area Lung Cancer Study 1998–2003
 ModelCross validation consistency2Testing balanced accuracy (%)2p-value from 1,000 permutation
  • 1

    12 subjects were deleted due to missing information for at least 1 SNP.

  • 2

    Average from 10 runs with different random seed.

  • 3

    Average from only 8 out of 10 runs due to instability of the chosen model.

1.Ever smoked10/1062.510.012
2.Ever smoked, rs17997874.2/1059.270.091
3.Ever smoked, rs11878644, rs17997932.75/10353.7530.531
4.Ever smoked, rs171140, rs17655, rs205819.9/1067.440.001

African Americans

For African American subjects, smoking was the best predictor of lung cancer with a prediction accuracy of 63.5% and a p-value of <0.001 (Supplemental Table II).

Principal components analysis

Results for the PCA (Supplemental Tables III–VI) were consistent with those from haplotype analyses.


The PCA identified a significant inverse interaction between principal component 2 (PC2) and smoking (Supplemental Table III), meaning that the increased risk associated with PC2 weakened as the number of pack-years smoked increased. Three of the 4 SNPs, (rs238406, rs11878644 and rs6966) which demonstrated strong correlations with PC2 also made up block 3 of ERCC2 in the haplotype analysis. The direction of the correlation between these 3 SNPs and PC2 indicated that the C allele of rs238406, G allele of rs11878644 and A allele of rs6966 constituted a group with lower risk; therefore, the result of the PCA was consistent with that of haplotype analysis (Table III).

African Americans

The PCA indicated a significant decreased risk of lung cancer associated with PC1 of LIG1, which is strongly correlated with rs3730931 and rs20579 (Supplemental Table VI). The positive correlation between rs3730931 and rs20579 and PC1 indicated that the alleles associated with reduced lung cancer risk are G for rs3730931 and A for rs20579, which were consistent with results from the haplotype analysis (Table III).


In this study, we used an integrative approach to analyze both single variants and haplotypes of genes in the NER pathway, including MDR analysis to account for the complex gene–gene and gene-smoking interactions, and PCA for thorough exploration of correlations among variants that are not linkage-phase dependent. For Latinos, in the MDR analyses, smoking was a strong predictor of lung cancer, as expected, but 3 SNPs (ERCC2 rs171140, ERCC5 rs17655 and LIG1 rs20581) also increased the case-control prediction accuracy, suggesting that additional effect modification by genetic factors may also be important. As MDR deals with statistical prediction, whether the results of MDR have any biological significance would need to be confirmed by laboratory studies.

Another strength of this study was the ability to control for ancestry differences among cases and controls within each ethnic group using ancestry informative markers. As previously described, cases of this study were ascertained from a population registry, while controls came from a variety of sources including random digit dialing, Health Care Financing Administration (Medicare) rolls and community sources such as churches, senior centers, etc.47 This may explain why the percentage of Amerindian ancestry was higher among Latino controls than cases; controls were more likely to have Central American heritage, whereas cases were more likely to be third or higher generation US ancestry and Mexican ancestry. Controlling for this difference in ancestry (population stratification) by inclusion of genetic ancestry in the logistic models as determined by an extensive panel of ancestry informative markers, increases confidence that observed differences among cases and controls for NER pathway genes is not due to ancestral differences. For Latinos, the adjustment for genetic ancestry moved the association toward the null for most SNPs or haplotypes, suggesting the existence of some population stratification, but the confounding of the gene-disease association by population stratification did not appear extensive. For African Americans, the results were almost identical with and without adjusting for genetic ancestry, suggesting that population stratification was minimal. One must be aware that as population stratification is dependent on different allele frequencies and disease risks among different ethnic groups, the minimal impact of population stratification observed in this study cannot be generalized to other studies with different SNPs and different admixed populations.

Comparisons of our results for each gene in relation to previously reported literature are discussed in detail later.


In this study, the Asp312Asn (rs1799793) was not significantly associated with lung cancer risk among either Latinos or African Americans. In contrast, the Gln/Gln genotype of Lys751Gln (rs13181) was associated with increased lung cancer risk among Latinos but not among African Americans. The only other study of ERCC2 and lung cancer among African Americans also reported a null association between Lys751Gln Gln/Gln genotype and lung cancer (OR = 1.03; 95% CI: 0.40–2.65) and did not report on other ERCC2 variants.8 These variants have been assessed in 20 studies of Asians and Caucasians with mixed results.5, 6, 8–11, 13, 17–21, 23, 24, 26, 29, 36–38, 41 A recent meta-analysis of ERCC2 genes in 11 populations found that the Asp312Asn polymorphism was not associated with risk of lung cancer68; and that the Lys751Gln Gln/Gln genotype yielded a pooled OR of 1.30 (95% CI: 1.13–1.49) with data from 15 study populations. This association was confined to Caucasians (OR = 2.25; 95% CI: 0.97–5.23) and was not apparent in Asian populations (OR = 1.02; 95% CI: 0.20–5.27).68 However, the null result could be due to a low frequency of Gln/Gln among Asians (≤2% for 3 of the 4 Asian studies included in the meta-analysis).68 More recent studies also showed no association of lung cancer risk with Asp312Asn polymorphisms in either Asians13, 24, 37 or Caucasians,9 whereas 136 of 59, 13, 19, 24, 36 recent studies showed an significant increased risk of lung cancer associated with Lys/Gln genotype of Lys751Gln. The functional impact of the ERCC2 polymorphisms is yet to be clarified. A recent study showed that the variants of Arg156Arg, Asp312Asn and Lys751Gln polymorphisms were all associated with a decreased mRNA expression.69 However, another study showed that the variants of Asp312Asn and Lys751Gln and the double variants of (Asp312Asn/Lys751Gln) had no impact on NER capacity or the basal transcription of ERCC2.70

Ethnic differences in associations of lung cancer risk with ERCC2 variants suggest that either those polymorphisms may only be important for certain ethnicities or the presence or absence of associations could result from different linkage patterns between the SNPs genotyped and the causal SNPs. There is a high variability in the allele frequencies and the linkage disequilibrium patterns of ERCC2 polymorphisms among Europeans, Africans and Asians.50 Thus, it is important to examine the association between ERCC2 haplotypes and the risk of lung cancer, as haplotype analysis may point to the important region(s) of the gene that warrant further examination. Furthermore, the lung cancer risk may not be attributed to individual SNPs but more to haplotypes, which may reflect the joint effect of multiple SNPs.

For Latinos, both the haplotype and PCA of ERCC2 suggested that Block 2 and Block 3 may be important regions associated with the risk of lung cancer for Latinos. The strongest association was for Block 3, which spans the 5′ upstream region of the ERCC2. Given the association observed in Latinos, further examination and sequencing of the 5′ upstream region of ERCC2 may be warranted, as it may contain important regulatory sequences and polymorphisms influencing the expression of ERCC2.

Among Latinos, interaction analyses showed that the association between lung cancer risk and ERCC2 haplotypes was confined to nonsmokers. Similar findings have been reported by 3 other studies in other ethnic groups.9, 11, 38 A possible explanation is that the extensive damage due to the high dose of carcinogens among heavy smokers overwhelms the DNA repair capacity of ERCC2, and the “protective” advantages of certain genotypes or haplotypes are attenuated or obliterated under such conditions.


Too few studies have examined variants in ERCC5 with lung cancer risk for consistent results to have emerged. Among African Americans in this study, those with the His/His genotype of Asp1104His had statistically significant higher lung cancer risk. Although similar results were reported by the only other study among African Americans, results were not statistically significant because of the small number of study subjects (71 cases and 71 controls).7 Significantly higher lung cancer risk among His1104 carriers has also been observed among Caucasians, Mexican Americans, Asian Americans7 and Koreans.14 In contrast, among Latinos, we observed a nonstatistically significant lower risk of lung cancer for those with His/His genotype. A lower risk of lung squamous cell carcinoma for His carriers was also suggested in a study among Japanese subjects.24 However, a study among Chinese found no association of His1104 genotype or 2 ERCC5 haplotype blocks with lung cancer risk.26 In contrast, a study among Caucasians reported increased lung cancer risk with the rare haplotype (CCCGA) formed by rs732321, rs4150360, rs3759500, rs3818356 and rs4771436.19 As we only typed 1 SNP for ERCC5, we were not able to perform haplotype analyses.

Among African Americans, our analysis suggested a possible interaction of ERCC5 variants with lung cancer risk with those with His/His genotype and ever smoked having the highest risk of lung cancer. Two studies reported a similar interaction between Asp1104His and smoking on the development of lung cancer.7, 14

The functional impact of Asp1104His polymorphism is currently unknown though the resulting amino acid substitution may potentially affect the structural integrity of the protein. Future laboratory assessment is necessary to determine the functional impact of this polymorphism.


Among Latinos, none of the 5 LIG1 SNPs included in this study were significantly associated with lung cancer risk although the numbers of A allele of rs20579 showed a borderline significant trend with increasing risk (p = 0.07). For African Americans, rs20579 A allele was significantly associated with a decreased lung cancer risk, whereas the rs439132 G allele was significantly associated with increased risk. A study among Eastern and Central Europeans showed that subjects who are heterozygous for rs20579 had an increased risk of young-onset lung cancer compared with those with homozygous wildtype genotype.15 In addition, the same study reported that the variant G allele of rs3730931 was associated with an increased risk of early-onset lung cancer, which was not observed by our study. Neither our study nor the study by Michiels et al.19 found any association of rs20581 (Asp802Asp) and rs156641 variants and lung cancer risk.

Among Latinos, neither our haplotype nor PCA revealed any association between LIG1 variants and lung cancer risk. For African Americans, our haplotype and PCA suggested that variations in rs3730931 and rs20579 or regions linked to those 2 SNPs may be associated with lung cancer risk. Similarly, the only other study of lung cancer risk and LIG1 haplotypes reported a statistically significant association,19 although different choices of SNPs and a study population with a different ethnic background make it difficult to compare the results of their haplotype analysis with ours.


Among Latinos, RAD23B Ala249Val variants were not significantly associated with lung cancer risk. We did not assess the Ala249Val polymorphism among African Americans because the MAF was low (4%). A study among Chinese reported an elevated lung cancer risk associated with having either Ala/Val or Val/Val genotypes.26 Another study also observed a higher frequency of the Val allele among lung cancer cases compared with controls (0.18 vs. 0.15) although not statistically significant.19


Similar to 8 previous studies, we did not observe a statistically significant associations of XPC Lys939Gln variants and lung cancer risk.3, 12, 15, 16, 19, 24, 26, 30

A major limitation of this study is the relatively small sample size which may have limited the statistical power to detect a weak SNP-disease association and increased the probability of spurious significant results. The small sample size in this study may not have sufficient power to detect gene-environment interactions; therefore, the results of the gene-smoking analysis should be viewed as exploratory. In addition, SNP coverage is sparse in the genes examined by this study so the negative findings may not necessarily preclude their importance in the development of lung cancer. Further studies should incorporate greater coverage of variation in NER pathway genes. Nevertheless, this is 1 of the few studies examining the association between NER SNPs and lung cancer among Latinos and African Americans.

In conclusion, among Latinos, this study showed that ERCC2 may be associated with risk of lung cancer especially among nonsmokers, and that smoking together with ERCC2, ERCC5 and LIG1 may have a joint influence on the development of lung cancer. For African Americans, we found that ERCC5 and LIG1 were independently associated with lung cancer risk. Thus, our study and others have suggested that different elements of the pathway may be important in the different ethnic groups resulting either from different linkage patterns, genetic backgrounds and/or exposure histories. These results need to be confirmed by future large-scale studies among Latinos and African Americans.


Dr. Jeffrey S. Chang was supported by the National Cancer Institute. We thank Dr. John Belmont of Baylor College of Medicine for the collection of Mayan DNA samples.