Genomic characterization of individuals presenting extreme phenotypes of high and low risk to develop tobacco‐induced lung cancer

Abstract Single nucleotide polymorphisms (SNPs) may modulate individual susceptibility to carcinogens. We designed a genome‐wide association study to characterize individuals presenting extreme phenotypes of high and low risk to develop tobacco‐induced non‐small cell lung cancer (NSCLC), and we validated our results. We hypothesized that this strategy would enrich the frequencies of the alleles that contribute to the observed traits. We genotyped 2.37 million SNPs in 95 extreme phenotype individuals, that is: heavy smokers that either developed NSCLC at an early age (extreme cases); or did not present NSCLC at an advanced age (extreme controls), selected from a discovery set (n = 3631). We validated significant SNPs in 133 additional subjects with extreme phenotypes selected from databases including >39,000 individuals. Two SNPs were validated: rs12660420 (p combined = 5.66 × 10−5; OR combined = 2.80), mapping to a noncoding transcript exon of PDE10A; and rs6835978 (p combined = 1.02 × 10−4; OR combined = 2.57), an intronic variant in ATP10D. We assessed the relevance of both proteins in early‐stage NSCLC. PDE10A and ATP10D mRNA expressions correlated with survival in 821 stage I–II NSCLC patients (p = 0.01 and p < 0.0001). PDE10A protein expression correlated with survival in 149 patients with stage I–II NSCLC (p = 0.002). In conclusion, we validated two variants associated with extreme phenotypes of high and low risk of developing tobacco‐induced NSCLC. Our findings may allow to identify individuals presenting high and low risk to develop tobacco‐induced NSCLC and to characterize molecular mechanisms of carcinogenesis and resistance to develop NSCLC.


Introduction
In the recent years, genetic susceptibility to lung cancer has been explored by many genome-wide association studies (GWAS), which have reported several genomic loci and candidate genes that exert moderate effects on lung cancer risk. The most well-known loci and genes identified are the CHRNA3 and CHRNA5 subunits of the α7 nicotinic acetylcholine receptor (nAChR), located on the 15q25 region [1][2][3]; TERT and CLPTM1L [4][5][6][7], which are closely located at 5p15.33; BAT3, mapped to 6p21.23 [8]; and GPC5 at 13q31.3 [9]. Most of these studies follow a random sampling design, including smoking habits as a covariate. Several noteworthy associations are found exclusively in subgroups of subjects defined upon their smoking habits. For example, the 15q25 locus is associated with lung cancer in former and current smokers, while the 13q31 locus is associated with susceptibility to lung cancer in never smokers [7,9].
However, despite tobacco smoke being the most relevant known risk factor for lung cancer, no association studies have analyzed the individual risk that it confers through the study of subjects presenting extreme phenotypes of high and low risk to develop non-small-cell lung cancer (NSCLC) induced by tobacco. Such designs may help to identify highrisk populations that could benefit from prevention and screening programs and to identify the molecular mechanisms that underlie such clinically relevant phenotypes.
We conducted a GWAS applying an extreme phenotype sampling in heavy smokers presenting high and low risk of developing NSCLC, and we validated our results using the same approach. We selected individuals who developed tobacco-induced NSCLC at early onset and healthy subjects who did not develop NSCLC at an advanced age, despite having a long smoking history. We aimed to identify new susceptibility variants related to the respective extreme phenotypes.

Study design
We performed a two-stage extreme phenotype study to increase the efficiency of discovering single nucleotide polymorphisms (SNPs) associated with the risk of developing tobacco-induced NSCLC. We hypothesized that risk Figure 1. Study design. From our series, we selected the individuals presenting extreme phenotypes of high and low risk of developing NSCLC induced by tobacco. Heavy smokers that either developed NSCLC at an early age were selected as extreme cases, and individuals that did not develop NSCLC at an advanced age, despite heavy tobacco consumption, were selected as extreme controls. The specific thresholds to define these populations were set to select the most extreme phenotypes in our series. *We included selected cases that developed NSCLC at extremely young ages but presented tobacco consumption <15 pack-years, given their phenotypic relevance and because we assumed that they were too young to achieve the threshold of smoke consumption. J. P. Fusco et al. Extreme Phenotypes of Lung Cancer Risk alleles would be strongly enriched in the phenotypic extremes, and therefore, a limited number of carefully selected individuals with extreme phenotypes might be sufficient to identify novel candidate genes and/or alleles [10].
We enrolled subjects into a discovery and a validation set (Fig. 1). The cancer cohort subjects (extreme cases) were selected from heavy smokers (≥15 packs-years) with histologically confirmed diagnosis of NSCLC at an early age (≤55 years). In the validation series, we included selected cases that developed NSCLC at extremely young ages but presented tobacco consumption <15 packs-years, given their phenotypic relevance and because we assumed that they were too young to achieve the threshold of smoke consumption. We also included some borderline cases for age (≤56 years). The cancer-free cohort individuals (extreme controls) were selected from heavy smokers (≥15 packs-years) that had not developed NSCLC or any other type of cancer at an advanced age (≥72 years). The thresholds for tobacco consumption and age were set with the aim to select from our series the individuals presenting the most extreme phenotypes regarding individual risk of developing NSCLC induced by tobacco.
The discovery set was recruited among 3631 patients included in the databases of the University Clinic of Navarra (Pamplona, Spain), Center for Applied Medical Research (CIMA, Pamplona, Spain), and Hospital Universitario Nuestra Señora de La Candelaria (Tenerife, Spain).
The independent validation set was recruited from the Spanish branch of the European Prospective Investigation in Cancer (EPIC) Project (www.epic-spain.com), which includes genomic DNA samples and clinical data from 39,880 individuals, and from additional new cases from the University Clinic of Navarra.
Samples and data from patients were processed following standard operating procedures approved by the respective Ethical and Scientific Committees. All patients gave written informed consent to allow the use of their biological samples and clinical data for research purposes. The study protocol was approved by the Ethics Committee of the University Clinic of Navarra.

DNA genotyping
Genomic DNA was obtained from peripheral blood mononuclear cells using the QIAamp DNA Mini Kit (Qiagen Iberia, Madrid, Spain) according to the manufacturer's instructions and stored at −20°C until use. Genotyping in the discovery set was performed using the Illumina HumanOmni2.5-Quad BeadChip according to the manufacturer's protocols (Illumina, San Diego, CA, USA). Genotyping in the validation set was performed using the Infinium assay following the manufacturer's instructions (Illumina).

Statistical analysis
Discovery and validation analyses, per-allele odds ratios (OR), and standard errors were estimated for each SNP using a multivariate logistic regression model, adjusting for sex. The covariates age and tobacco consumption were used for the design, and therefore, they were not included in the statistical analysis.
Genotype clustering and calling were carried out using GenTrain v2.0 in GenomeStudio v2011.1 (Illumina). Samples were excluded from further analyses if they had more than 2% missing genotype data SNPs with a call rate<0.95 or deviated from the Hardy-Weinberg equilibrium (HWE) (p-value ≤1 × 10 −5 ) were also excluded. From the discovery analysis, SNPs with a p-value ≤5 × 10 −4 and minor allele frequency (MAF) > 5% in the single SNP analysis were selected for validation. When two SNPs presented a very high correlation (r 2 > 0.8), they were considered redundant, and only the one that presented the highest Illumina score was selected.

Genomic imputation
To find map causing variants, data imputation for the untyped SNPs at all the validated SNPs was performed using IMPUTE v2.3.2 [11] for samples from the discovery set using phased haplotypes from the 1000 Genomes Phase 3 (b37) panel as reference [12]. Those imputed genotypes with scores >0.3 were tested for association.

Assessment of the prognostic value of mRNA expression and protein expression of target genes
We attempted to validate the functional relevance of the genes associated with the validated SNPs in NSCLC, assessing the expression of the related mRNA and proteins in patients with early-stage NSCLC. We correlated the transcriptomic data of ATP10D and PDE10A with recurrence-free survival (RFS) and overall survival (OS) using the KM-Plotter application [13] in patients with early-stage NSCLC. The database used (http://kmplot.com) includes gene expression data and survival information on 821 patients with stage I-II NSCLC, downloaded from the Cancer Biomedical Informatics Grid (CaBIG), the Gene Expression Omnibus (GEO), and the Cancer Genome Atlas (TCGA). Survival curves were estimated using Kaplan-Meier curves. Significant differences between groups were tested using the log-rank test. RFS and OS were calculated respectively from the date of surgery to the date of recurrence or death.
We assessed the prognostic value of PDE10A protein expression in tumor samples from 149 patients with stage I-II NSCLC resected at University Clinic of Navarra. PDE10A immunohistochemical assay and staining scores established by semiquantitative analysis were performed as previously described [14]. We used an anti-human PDE10A antibody (Genetex) diluted 1:500. The specificity of PDE10A antibody was demonstrated by Western blot analysis and immunocytochemistry of cell lines expressing different levels of the protein. Isotype and negative (omission of the primary antibody) controls were performed [14].
The univariate and multivariate Cox proportional hazards modeling were used to determine the effects of variables on cancer-specific OS and RFS. Variables with p < 0.1 in the univariate analysis were included in the multivariable analysis. The proportional hazards assumption was examined by testing interactions between the covariables of the final model and time. Statistical analysis was performed using SPSS 15.0 (Chicago, Ill., USA) and Stata12 (College Station, TX: USA).

Identification and validation of SNPs related to individuals presenting extreme phenotypes of developing NSCLC induced by tobacco
We analyzed 228 individuals with extreme phenotypes, 95 in the discovery set and 133 in the validation set ( Table 1). The discovery set included 47 extreme cases that developed NSCLC at an early age (mean: 49 years) and were heavy smokers (mean: 41 packs-years). Extreme controls were 48 individuals that did not develop NSCLC at an advanced age (mean: 76 years) despite high tobacco consumption (mean: 69 packs-years). One control from the discovery set was excluded due to low call rate (call rate <0.98). The validation set included 78 extreme cases (mean age: 49 years; and 38 packs-years) and 55 extreme controls (mean age: 77 years; and 48 packs-years).
We successfully genotyped 2,379,855 variants in the discovery set, and 61,061 were excluded from the analysis due to low genotyping call-rate (0.95), and 370 were deviated from HWE in controls, leaving 2,318,553 for further analysis. Thirty-six SNPs presenting high association in the GWAS (p < 5 × 10 −4 , Table S1) were analyzed in the validation set.
Two SNPs were successfully replicated in the validation set ( Table 2). The strongest validated association was with SNP rs12660420 at the 6q27 locus (combined p = 5.66 × 10 −5 ; combined OR = 2.80, 95% CI: 1.69-4.61), which maps to a non-coding transcript of the PDE10A gene. This SNP tags a linkage disequilibrium (LD) block of 85.8Kb in chromosome 6 (coordinates 166,109,230-166,195,063) (NCBI build 37 assembly), as defined by the furthest upstream and downstream SNPs displaying a detectable correlation (r 2 > 0.20) with rs12660420.
Subsequently, genotypes of all other known variants in the locus with minor allele frequency >5% were imputed to this locus. In total, 275 SNPs were reliably imputed (imputation r 2 score>0.3) and analyzed but none showed a lower p-value than rs12660420. The LD plot showing the p-values of these SNPs is shown in Figure 2. The related gene, PDE10A, encodes a protein that belongs to the cyclic nucleotide phosphodiesterase family. PDE10A expression correlates with lung cancer prognosis, and its pharmacological inhibition suppresses growth of NSCLC cell lines [15], so it seems a plausible candidate gene for lung cancer risk. Although the validated SNP lies in a non-coding transcript of the PDE10A gene, genetic variation in regulatory elements may influence expression.
The second validated SNP was rs6835978 (chr4:47498797) in 4p12 (combined p = 1.02 × 10 −4 ; combined OR = 2.57, 95% IC: 1.6-4.1, Table 2), an intronic variant in the ATP10D gene (ATPase phospholipid transporting 10D) [16]. ATP10D has not previously been related to lung cancer or other tumors. rs6835978 tags to a LD block of 82Kb in chromosome 4 (47,481,971-47,564,368), defined as described above. A total of 228 SNPs were reliably imputed in the region, and rs6835978 remained the most significantly associated SNP (along with rs12510653, in complete LD with it [r 2 = 1]). The rectangle below the plot shows the genes mapping in the region as well as, shaded in light blue, the LD block tagged by the associated SNPs. Recombination rates are based on the 1000 Genomes Project, and genomic coordinates are based on Genome Reference Consortium Human Build 37 (GRCh37/hg19). Plots were generated using LocusZoom software [31].
Extreme Phenotypes of Lung Cancer Risk J. P. Fusco et al.

Assessment of the prognostic value of mRNA expression and protein expression of target genes in patients with early-stage NSCLC
We evaluated the prognostic value of mRNA expression of PDE10A and ATP10D with the in silico tool KM-plotter [13] in 821 patients presenting stage I-II NSCLC. We analyzed PDE10A (probes 211170 and 211171) and ATP10D (probe 213238). We classified patients into two groups according to the median value of expression of each biomarker. Patients with higher PDE10A mRNA expression showed decreased OS using both probes (p = 0.011 and p < 0.0001; Fig. 3, panels A and B). Low mRNA expression of ATP10D was associated with shorter OS (p < 0.0001; Fig. 3, panel C).

Discussion
We identified and validated two novel variants associated with individuals presenting extreme phenotypes of risk to develop NSCLC induced by tobacco. To our knowledge, this is the first study that uses this strategy to identify novel genetic variants related to NSCLC cancer risk. This methodology consists of studying reduced groups of individuals with very characteristic and clinically relevant phenotypes, thus enriching biomarker expression in the individuals studied, and it has allowed outstanding cancer biomarkers to be identified, as reviewed elsewhere [17]. In the discovery phase, we used the HumanOmni2.5-Quad BeadChip array, a high-density chip that allows genotyping of 2.5 million SNPs across the whole genome, targeting genetic variation down to 1% MAF. To our knowledge, this is the first study that assesses cancer risk using this powerful platform.
The strongest validated association was SNP rs12660420, which maps to a non-coding transcript of PDE10A. This gene encodes a protein that belongs to the cyclic nucleotide phosphodiesterase family. Phosphodiesterases (PDE) represent a large superfamily of hydrolases that control the intracellular levels of cyclic nucleotides by hydrolyzing cAMP and cGMP to 5_AMP and 5_GMP and are critical regulators of the myriad physiological and pathophysiological processes under cyclic nucleotide control [18]. PDE4, a well-known target for therapy of chronic obstructive pulmonary disease [19], is also expressed in lung cancer and promotes lung cancer progression [20]. Preclinical data indicate that PDE4 inhibitors have also a role in the treatment for EGFR-mutant NSCLC patients with high pretreatment levels of BIM and mTOR [21]. Li et al. showed that PDE10A mRNA and protein levels are overexpressed in colon tumor cells as compared to normal colonocytes; that silencing PDE10A suppressed the growth of colon tumor cells; and that overexpression of PDE10A promotes growth of colonocytes and adenoma cells [22]. The same group recently reported that Pf-2545920, a highly specific PDE10A inhibitor, inhibits colon tumor cell growth at concentrations that increase cGMP and cAMP levels and activate PKG and PKA; and that Pf-2545920 reduces βcatenin-mediated transcription of survivin, resulting in caspase activation and apoptosis [23]. Shen et al. [15] have reported that PDE10A mutations are predicted to influence PDE10A allosteric regulation in lung adenocarcinoma; that high levels of PDE10A expression correlate with worsened lung adenocarcinoma prognosis; and that Pf-2545920 also suppresses the growth of NSCLC cell lines. Finally, PDE10A somatic mutations have been described in up to 19% of prostate cancers and correlated with increased levels of phosphorylated C-AMP responsive element binding protein (p-CREB) [24]. These data suggest that PDE10A might be a relevant target for cancer therapy and prevention.
The second validated variant, SNP rs6835978, is an intronic variant of ATP10D, which belongs to a subfamily of P-type ATPases implicated in the translocation of phospholipids from the exoplasmic to the cytoplasmic leaflet of the cellular biological membranes. Although ATP10D has not previously been related to NSCLC, genetic variation in this gene has been associated with levels of plasma sphingolipids [25]. Bioactive sphingolipids, such as sphingosine-1-phosphate (S1P), and ceramides, are signaling molecules involved in the activation of pathways that are directly relevant to carcinogenesis [26]. Ceramide plays a major role in the development of chronic pulmonary diseases and has also been involved in lung cancer development induced by cigarette smoke [27]. Alberg et al. reported that higher concentrations of S1P and total ceramide in plasma were associated with an increased future risk of lung cancer [28].
We analyzed the prognostic value of PDE10A and ATP10D mRNA and PDE10A protein levels in patients with stage I-II NSCLC. We confirmed that high and low mRNA expression, respectively, of PDE10A and ATP10D significantly correlated with shorter survival. To our knowledge, this is the first association of ATP10D with cancer prognosis. We confirmed that PDE10A protein levels in stage I-II NSCLC patients correlated with decreased survival. The results that we obtained for mRNA and protein expression of PDE10A are consistent among them and also with previous results reported in NSCLC [15]. These results render PDE10A and ATP10D as potentially useful prognostic and therapeutic targets in NSCLC.
Tobacco-induced NSCLC represents one of the most relevant challenges to public health. Therefore, the identification of genetic factors that confer to individuals either an increased risk or an intrinsic protection to develop the disease would be of critical relevance, allowing the identification of high-risk populations, in which tobacco cessation and screening programs may be most beneficial. Most importantly, it may lead to further understanding of mechanisms of carcinogenesis and natural protection against cancer, which might inspire new therapeutic strategies. Interestingly, even though our study was specifically designed for NSCLC, the individuals in the cancer-free cohort did not develop any other tumors. Therefore, our results may also relate to other neoplasms, especially those related to tobacco.
To our knowledge, this is the first study that defines and targets phenotypes of decreased risk to develop tobaccoinduced NSCLC and that attempts to characterize the genetic causes of such protection. The study of these proficient cancer risk phenotypes (PROCARPs) may allow to identify the underlying protective mechanisms and may provide valuable targets for cancer prevention. The identification of PROCARPs is challenging, because it requires to distinguish individuals presenting a truly decreased cancer risk from others which just present absence of the disease (i.e., they are non-apparent phenotypes) [29]. We hypothesized that the selection of individuals that are overexposed to either extrinsic or intrinsic cancer risk factors and yet do not develop cancer may allow to identify and study PROCARPs [17,30]. On the other hand, phenotypes of increased cancer risk (deficient cancer risk phenotypes, DECARPs) are apparent phenotypes that are readily identifiable (e.g., cancer familiar syndromes, cancer development at young ages, etc.) [29].
The main limitation of our study is that the sample size is limited due to the difficulty to enroll extreme individuals. This may impair to reach p-values of genome-wide significance. Nevertheless, to overcome this limitation, we explored mRNA and protein expression of PDE10A and ATP10D in two different patient populations of early-stage NSCLC, and we found very significant correlations with survival. Functional validation will also be required to confirm the relevance of our findings. Also, despite our efforts to recruit highly selected individuals, phenotypic heterogeneity may persist regarding demographical and clinical characteristics. Nevertheless, this limitation constitutes in fact the greatest opportunity to improve this strategy, through the application of more stringent selection criteria to identify more homogeneous phenotypes with regard to age, tobacco consumption, or other relevant variables. Our strategy may also be pursued using different high-throughput techniques, such as exome or genome sequencing or assessment of the transcriptome, and epigenome. Finally, additional studies will be required to further elucidate the optimal definitions of extreme phenotypes.
In summary, we identified and validated two new genetic variants in PDE10A and ATP10D associated with individuals presenting extreme phenotypes of high and low risk of developing tobacco-induced NSCLC, and we confirmed the prognostic relevance of the associated proteins in early-stage NSCLC. Our findings may have relevant implications to define high-risk cancer populations and to