DNA methylation changes measured in pre‐diagnostic peripheral blood samples are associated with smoking and lung cancer risk

DNA methylation changes are associated with cigarette smoking. We used the Illumina Infinium HumanMethylation450 array to determine whether methylation in DNA from pre‐diagnostic, peripheral blood samples is associated with lung cancer risk. We used a case‐control study nested within the EPIC‐Italy cohort and a study within the MCCS cohort as discovery sets (a total of 552 case‐control pairs). We validated the top signals in 429 case‐control pairs from another 3 studies. We identified six CpGs for which hypomethylation was associated with lung cancer risk: cg05575921 in the AHRR gene (p‐valuepooled = 4 × 10−17), cg03636183 in the F2RL3 gene (p‐valuepooled = 2 × 10 − 13), cg21566642 and cg05951221 in 2q37.1 (p‐valuepooled = 7 × 10−16 and 1 × 10−11 respectively), cg06126421 in 6p21.33 (p‐valuepooled = 2 × 10−15) and cg23387569 in 12q14.1 (p‐valuepooled = 5 × 10−7). For cg05951221 and cg23387569 the strength of association was virtually identical in never and current smokers. For all these CpGs except for cg23387569, the methylation levels were different across smoking categories in controls (p‐valuesheterogeneity ≤ 1.8 x10 − 7), were lowest for current smokers and increased with time since quitting for former smokers. We observed a gain in discrimination between cases and controls measured by the area under the ROC curve of at least 8% (p‐values ≥ 0.003) in former smokers by adding methylation at the 6 CpGs into risk prediction models including smoking status and number of pack‐years. Our findings provide convincing evidence that smoking and possibly other factors lead to DNA methylation changes measurable in peripheral blood that may improve prediction of lung cancer risk.

(a total of 552 case-control pairs). We validated the top signals in 429 case-control pairs from another 3 studies. We identified six CpGs for which hypomethylation was associated with lung cancer risk: cg05575921 in the AHRR gene (p-value pooled 5 4 3 10 217 ), cg03636183 in the F2RL3 gene (p-value pooled 5 2 3 10 2 13 ), cg21566642 and cg05951221 in 2q37.1 (p-value pooled 5 7 3 10 216 and 1 3 10 211 respectively), cg06126421 in 6p21.33 (p-value pooled 5 2 3 10 215 ) and cg23387569 in 12q14.1 (p-value pooled 5 5 3 10 27 ). For cg05951221 and cg23387569 the strength of association was virtually identical in never and current smokers. For all these CpGs except for cg23387569, the methylation levels were different across smoking categories in controls (p-values heterogeneity 1.8 x10 2 7 ), were lowest for current smokers and increased with time since quitting for former smokers. We observed a gain in discrimination between cases and controls measured by the area under the ROC curve of at least 8% (p-values 0.003) in former smokers by adding methylation at the 6 CpGs into risk prediction models including smoking status and number of pack-years. Our findings provide convincing evidence that smoking and possibly other factors lead to DNA methylation changes measurable in peripheral blood that may improve prediction of lung cancer risk.
Smoking is the main cause of lung cancer with attributable risks of at least 85% for men and 60% for women, 1,2 yet a significant number of cases cannot be attributed either to cigarette smoking or other established risk factors such as air pollution. The major mechanisms considered to explain the effect of cigarette smoking on lung cancer risk include the exposure to carcinogenic compounds, the formation of DNA adducts and the accumulation of permanent somatic mutations in tumor suppressor genes and dominant oncogenes 3 . The balance between metabolic activation and detoxification of carcinogens varies between individuals and this is likely to be at least partly responsible for the variation in susceptibility to lung cancer among smokers.
Alterations of the methylation profile of DNA from peripheral blood associated with cigarette smoking have been recently described [4][5][6][7][8][9][10][11] . The altered DNA methylation levels persist long after smoking cessation for some genomic locations (i.e. CpGs) while for others return to those of never-smokers 12 . Recently, using a study conducted within the NOWAC cohort as discovery data set and studies within the Australian MCCS cohort, the Swedish NSHDS and the EPIC-Heidelberg cohort as replication sets, we observed that smoking-associated DNA methylation alterations in CpGs in the AHRR and F2RL3 genes are associated with lung cancer risk 13 ; the association with alterations in DNA methylation in F2RL3 was also reported in a recent German study 14 . Our study provided initial suggestive biological plausibility and statistical evidence that these epigenetic alterations may partly mediate the effect of smoking on lung cancer risk 13 .
In order to identify novel DNA methylation changes associated with lung cancer risk and better understand the mechanisms underlying these associations, we conducted a further analysis of the four epigenome-wide association studies (EWAS) in NOWAC, MCCS, NSHDS, EPIC-Heidelberg and in a new, independent EWAS in the EPIC-Italy cohort that became available recently. In all these five case-control studies nested within prospective cohorts, we investigated associations between methylation of DNA from pre-diagnostic, peripheral blood samples and lung cancer risk accounting for reported smoking habits.

Discovery and replication sets
To test our hypotheses regarding the relationship between DNA methylation and lung cancer risk, we used data from a new EWAS within the Italian component of the European Prospective Investigation into Cancer 15 (EPIC-Italy) 16 and a previous one within the Melbourne Collaborative Cohort Study (MCCS) 17 as discovery sets and data from previous EWAS from the Norwegian Women and Cancer study (NOWAC) 13 , the Northern Sweden Health and Disease Study (NSHDS) 18 and EPIC-Heidelberg 19 as replication sets. All five studies were case-control studies nested within prospective cohorts including 367, 185, 132, 234, and 63 case-control pairs respectively for which methylation was measured using the Illumina Infinium Human Methylation 450 BeadChip on DNA extracted from prediagnostic, peripheral blood samples. Relative to our previous report 13 , here we present the new study within EPIC-Italy as well as the complete data from the MCCS that in the previous report was used only to replicate the signals in the AHRR and F2RL3 genes. For all studies one control was individually matched to each case. In MCCS and NSHDS controls were matched to cases by reported smoking status at blood draw using five categories (never smokers; short-term former What's new? It is well known that smoking can cause lung cancer but the concept that it might do so by changing DNA methylation is only emerging. Here the authors identify six sites of methylation (CpGs), where methylation levels were associated with lung cancer risk after adjusting for smoking, current or former. Methylation of five of the CpGs was lowest in current smokers and increased in former smokers with time since quitting, supporting the growing evidence that smoking may lead to DNA methylation changes measurable in peripheral blood and useful as predictive markers for lung cancer risk, especially in former smokers.

Cancer Epidemiology
Baglietto et al.
smokers: quitting smoking <10 years before; long-term former smokers: quitting smoking 10 years or more before; current light smokers: <15 cigarettes per day; and current heavy smokers: 15 cigarettes or more) in EPIC-Heidelberg they were matched by reported smoking status in two categories (current and former) and number of pack-years. In EPIC-Italy and NOWAC controls were not matched by smoking. Further details for each study are provided in the Supplementary Materials.
Laboratory methods, data pre-processing and quality control Laboratory methods for DNA extraction, quality control, bisulphite conversion and Illumina Infinium HumanMethyla-tion450 BeadChip assays as well as details about data preprocessing and quality control are described in detail in the Supplementary Materials and were broadly similar across studies. Exceptions are noted explicitly. For the MCCS some DNA samples were extracted from dried blood spots on Guthrie cards using a method developed in-house 20 . Cases with DNA available only from dried blood spots were matched to controls with the same type of DNA available.
Normalisation procedures of the methylation measures were applied to perform colour channel and probe type correction as described in the Supplementary Materials.

Statistical analysis
For all analyses we used M-values of methylation calculated as log 2 (beta/(1-beta)) 21 . To quantify the association between the methylation level at each CpG and the risk of lung cancer we fitted conditional logistic regression models separately for the two discovery sets, the MCCS and EPIC-Italy. For EPIC-Italy, for which smoking was not a matching variable, we adjusted the regression models for smoking. In the regression models, we included as a predictor the pseudo-continuous M-value of methylation at each CpG that we obtained by dividing the Mvalues into quartiles according to the distribution in the control group and assigning to each category the within-quartile median value. We estimated odds ratios per 1 standard deviation (SD) of the pseudo-continuous variable and the corresponding 95% confidence intervals (95% CI).
We ranked the CpGs according to the p-values of the corresponding ORs, separately for MCCS and EPIC-Italy, and identified 34 CpGs with a p values lower than 10 24 for at least one study. For these CpGs we calculated pooled MCCS and EPIC-Italy estimates and selected for further analyses and replication 6 CpGs whose combined estimate had a p values lower than 10 25 . For the selected 6 CpGs and for the previously identified cg03636183 in the F2RL3 gene, we estimated the ORs for lung cancer separately for MCCS, EPIC-Italy and the three replication studies. For this set of seven CpGs, we estimated pooled ORs by combining the study-specific estimates fitting fixed effect models overall and for different categories of smoking (never, former and current). We adjusted the estimates for current and former smokers for number of cigarettes and duration of smoking and the estimates for former smokers also for time since quitting smoking. We estimated ORs for different categories of time to diagnosis and used a likelihood ratio test to test for heterogeneity. We assessed the possible effect of cell composition on the results by adding into the models the proportions of different cell types (CD81, CD41, natural killer cells, B-cells, monocytes, granulocytes) calculated using the method suggested by Houseman 22 .
We also estimated the association between M-values of methylation and reported smoking for the control group by fitting a linear mixed effect model with slide (i.e. chip) nested within plate fitted as random effects and gender, age at blood collection and smoking fitted as fixed effects. We used likelihood ratio tests to test the association between methylation and smoking.
We evaluated separately for the MCCS and EPIC-Italy the additional contribution of DNA methylation at the CpGs associated with lung cancer risk to the ability of the model to discriminate between cases and controls using area under the curve (AUC) statistics obtained from unconditional logistic regression models adjusted for the matching variables. We accounted for the contribution of smoking by including among the covariates smoking status and the number of pack-years of cigarettes smoked.
Finally, for the CpGs whose pooled estimates had an OR with a p values of <10 27 , we investigated graphically the association between DNA methylation and lung cancer risk in the 100 kilobase region around each CpG site, by plotting the pooled MCCS and EPIC ORs versus the CpG location.

Genome-wide association analysis in the two discovery sets
Relative to the total number of CpGs investigated across the genome, the proportion of CpGs with methylation levels inversely associated with lung cancer risk was 55% in the new EPIC-Italy study and 53% in MCCS. Overall, for 48% of the CpGs we observed concordant associations between methylation level and lung cancer risk in MCCS and EPIC-Italy (either both negative or both positive) and of these 58% were concordant negative.
We identified 34 CpGs for which the smoking-adjusted association with lung cancer risk corresponds to a p values lower than 10 24 in at least one of the two studies (Supporting Information Fig. 1), and these are presented in Table 1 with their ORs and p-values. Of these CpGs, 22 were from MCCS, 9 from EPIC-Italy and 3 were common to both studies: cg21566642 on chromosome 2, cg05575921 in the AHRR gene on chromosome 5 and cg06126421 on chromosome 6. Table 1 also presents the estimates for cg03636183 in the gene F2RL3 on chromosome 19, that, together with cg05575921 in the AHRR gene we previously reported to be associated with lung cancer risk 13 .
Of the associations corresponding to the 34 CpGs listed in Table 1, 73%; (95% CI, 54% to 86%, p 5 0.01) were concordant in the two studies and, of these, 79% (95% CI, 0.57% to 0.92%, p 5 0.008) were concordant negative (Supporting Information Fig. 1-right panel).  For six CpGs (cg05951221, cg21566642, cg05575921, cg06126421, cg23387569 and cg12312863) the pooled ORs for lung cancer across EPIC-Italy and MCCS had a p values lower than 10 25 (Table 1). For these six CpGs and for cg03636183 in the F2RL3 gene, we tested the association with lung cancer risk in another three independent studies within NOWAC, EPIC-Heidelberg and NSHDS (  Table 2 and Fig. 1). For these six CpGs, the results did not materially change when the analyses were adjusted for estimated cell composition (Supporting Information Tables 1 and 2); we did not observe heterogeneity in the ORs for lung cancer between studies (all p-values for heterogeneity 0.1, Fig. 1) or by time between blood draw and diagnosis overall or by smoking status (Supporting Information Table 3).
Associations with reported smoking history and associations with lung cancer risk by smoking category Of the six CpGs associated with lung cancer risk, the methylation levels of five (cg05951221, cg21566642, cg05575921, cg06126421, and cg03636183) were strongly associated with reported smoking history in the control groups (p-values for heterogeneity across smoking categories all 1.8 3 10 27 ). DNA methylation levels were lowest for current smokers while average levels for former smokers were intermediate between those for current and never smokers; DNA methylation levels for former smokers increased with increasing time since quitting (Table 3; Supporting Information Figs. 2 and 3).
To investigate whether the association between methylation levels at the 6 CpGs and lung cancer risk could be due to residual confounding by smoking, we conducted stratified analyses by smoking status separately for each of the five studies and overall ( Table 2). For all the CpGs the pooled ORs were lower than unity for former and current smokers and the ORs were consistently lower for former smokers than for current smokers; for all CpGs the OR for never smokers was nominally lower than unity but none of the ORs for never smokers were statistically significant.

Ability of methylation levels to predict lung cancer risk
For EPIC-Italy, the value of the AUC for the model including reported smoking status (categorised as never, former and current smoker) and the number of pack years was 79%. Adding methylation levels for each CpG individually resulted in gains of 1.8% for cg05575921, 1.3% for cg06126421 and <1% for each of the other selected CpGs (Table 4 and Supporting Information Fig. 4). When methylation levels for all the 6 CpGs were included simultaneously in the model, their additional contribution to lung cancer risk prediction was 2.6% (p 5 0.034) overall; 11.1% (p 5 0.011) in former smokers and 1.2% in current smokers (p 5 0.52). We obtained similar results from the MCCS in which the overall gain from including all the CpGs combined was 5.5% (p 5 0.002) relative to the model including smoking history (categorized as never smokers; former smokers who stopped <10 years before blood draw; former smokers who stopped 10 or more years before blood draw; current smokers who smoked <15 cigarettes per day; and current smokers who smoked 15 or more cigarettes per day) and number of pack-years; the gain was 7.6% (p 5 0.004) in former smokers and 3.3% (p 5 0.28) in current smokers.

Analyses of regions around the CpGs associated with lung cancer risk
Both cg05951221 and cg21566642 are located in a CpG island on chromosome 2q37.1 in which a region with differential methylation between cases and controls is clearly visible (Fig. 2a); this region extends for approximately 2 kilobases and includes 8 CpGs for which lower methylation levels are associated with an increased risk of lung cancer. The correlation between the methylation M-values for cg05951221 and cg21566642 (259 bases apart) was 0.80 and for cg21566642 and cg01940273 (273 bases apart) was 0.75. The 100k-base region around cg05951221 and cg21566642 contains three genes (ALPPL2, ALPP, and ALP-1, and ECEL1) and the pseudogene ECEL1P2 whose methylation has been found to increase through development 23 . Alkaline phosphatases (ALPs) dephosphorylate a variety of molecules such as proteins, nucleotides and alkaloids. Serum ALPP and

Cancer Epidemiology
Baglietto et al. ALPPL2 enzyme levels are increased in heavy smokers and in cancer, particularly in seminoma 24 . We observed no significant correlation between cg05575921, located within the AHRR gene on chromosome 5p15.33, and the nearby CpGs (Fig. 2b). In the 100 kilobases flanking the probe maps one more gene (EXOC3) that codes for a component of the exocyst complex and its antisense RNA (EXOC3-AS1).
The probe cg06126421 on chromosome 6p21.33 is flanked by a region extending approximately 200 bases containing another 5 CpGs whose methylation levels correlate with methylation levels of cg06126421 (correlations ranging from 0.44 to 0.67) (Fig. 2c). In the 100 kilobases flanking cg06126421 there are seven genes that code for proteins involved in cell cycle checkpoints in response to DNA damage (MDC1), cellular growth and division (DHX16), protection of cells from Fas-or tumor necrosis factor type alpha-induced apoptosis (IER3), cytoskeleton regulation and membrane traffic (PPP1R18, TUBB, FLOT1, NRM). In particular, FLOT1 mRNA expression has been shown to be upregulated in non-small cell lung cancer tissue 25 . Also, two long non-coding RNAs (lncRNAs) map in the region: MDC1-AS1 and LINC00243.
A region of 300 bases extends around cg23387569 on chromosome 12q14.1 as methylation levels for the six CpG sites in this region quite strongly correlate with methylation level at cg23387569 (correlations ranging from 0.52 to 0.87) (Fig. 2d). The region is located in a CpG island within the AGAP2 gene, which encodes a protein belonging to the centaurin gamma-like family that mediates anti-apoptotic effects of nerve growth factor by activating nuclear phosphoinositide 3-kinase. The AGAP2 gene is overexpressed in cancer cells, and promotes cancer cell invasion. The region surrounding cg23387569 has been previously found amplified in lung cancer 26 together with other four genes that map in the region (CDK4, CYP27B1, METTL1, and TSFM). One of them codes for the cyclin-dependent kinase 4, a member of the Ser/Thr protein kinase family that is important for cell cycle G1-S transition by the RB1-CCND1-CDKN2A pathway that is known to be damaged in lung cancer.

Discussion
This new analysis, combining data from the four EWAS that previously allowed us to identify the first two CpGs in AHRR and F2RL3 associated with lung cancer risk with previously unpublished new data from a novel EWAS in EPIC-Italy, led to the discovery of four additional CpGs and showed that methylation at these CpGs may be useful to improve current risk prediction models based on self-reported smoking history. Our previous report that DNA methylation changes at cg05575921 in the AHRR gene and at cg03636183 in the F2RL3 gene were associated with lung cancer risk 13 included mediation analyses which provided initial suggestive evidence that residual confounding was unlikely to explain the observed associations for cg05575921 and cg03636183, and that hypomethylation at these two sites may mediate the effect of tobacco on lung cancer risk. For cg03636183, an association with lung cancer risk similar to the one we observed was also reported in a study of 4,987 participants in the German ESTHER cohort, of which 97 developed lung cancer during a median follow-up of around 11 years 14 . In ESTHER only three CpGs in the F2RL3 gene, including cg03636183, were measured using mass-spectrometry (i.e. MALDI-TOF) and were targeted because of the established strong association between cigarette smoking and methylation at this site.
Relative to our previous report 13 , in the present analyses we have included data from a new case-control study nested within EPIC-Italy and presented the complete data from the MCCS that was previously utilised only to validate the CpGs within the AHRR and F2RL3 genes; consequently, the results reported here for all other CpGs can be considered original and independent from those previously published. To investigate the possible role of smoking in explaining the observed associations between DNA methylation and lung cancer risk, we deliberately oversampled cases of former and never smokers from some of the cohorts.
The associations we observed between DNA methylation and lung cancer risk are relatively strong (ORs for 1 SD increase in DNA methylation are between 0.74 and 0.50), they are not limited to current smokers and they remained strong after adjusting for smoking duration and intensity: this suggests that the associations between DNA methylation and lung cancer risk are unlikely to be explained by residual confounding by smoking. The observation that for all the identified CpGs except one the methylation levels are lower in current smokers and rise to the levels of never smokers with increasing time since quitting suggests that smoking contributes to the methylation status of these CpGs, although it might not be the only determinant and disentangling the relation between smoking, methylation and lung cancer risk might be challenging 27 .
Interestingly, for all the six CpGs identified the ORs for lung cancer for former smokers are consistently lower than ORs for current smokers. This observation, consistent across all five studies, is intriguing but difficult to explain. The analyses by smoking have been adjusted for smoking intensity, duration and time since quitting in former smokers, but we cannot exclude that the result is due to misreported smoking habits or residual confounding. It has been recently reported that inflammation processes such as those caused by smoking induce changes in the methylation profile of natural killer cells including hypomethylation in the AHRR gene 28 for which we observe associations with smoking and lung cancer risk in our studies. The stronger associations observed in former smokers might reflect the activation of a persistent immune response to smoking that continues and does not resolve years after smoking cessation for selected ex-smokers who develop lung cancer. Although the adjustment for cell composition with the algorithm proposed by Houseman and colleagues does not materially modify any of the observed associations, the algorithm does not include all minor immune cell fractions that might still have a role in confounding the results 22 . Lung diseases, including lung cancer, may trigger an immune response and alter the prevalence of specific cell types in the blood 29 ; it is therefore possible that the immune response generated by undiagnosed lung cancer already present for some cases at baseline may lead to differences in the overall methylation profile that could potentially explain our findings. However, the possibility that the observed associations are because of the effect of subclinical lung cancer is not supported by our data as the observed associations did not change when the analyses were stratified by time between blood draw and lung cancer diagnosis.
The observations that at cg23387569 the association between methylation levels and smoking history was not evident or at least not as strong as for the other CpGs and that for all CpGs the associations between DNA methylation levels and lung cancer were not limited to current smokers suggest that DNA methylation changes at these CpGs may play a role in pathways to lung cancer that are independent of smoking. Further studies specifically designed to increase the number of lung cancer cases in never smokers are necessary to provide convincing evidence to support this hypothesis.
The use of the Illumina Infinium HumanMethylation 450 BeadChip allowed us to obtain epigenome-wide data at single CpG resolution for a relatively large number of cases and controls but these microarray data do not permit the systematic investigation of the regions surrounding the CpGs identified to evaluate whether the observed associations may be regional and thus have greater predictive power should a more comprehensive measure of methylation be possible. In the region on chromosome 2q.37.1, for example, two measured CpGs (cg21566642 and cg05951221) were both strongly associated with lung cancer risk and others in the same region show suggestive evidence of association. It is possible that methylation at other unmeasured sites in this region might be even more strongly associated with lung cancer risk. It is, therefore, important that further studies, for example based on targeted bisulphite sequencing, are conducted to finely map methylation in these regions.
To further investigate the functional relevance of the observed associations it would be important to test whether methylation in the CpGs identified alter the expression of proximal genes. We could not do this directly in our cohorts as we do not have gene expression data but we have investigated it in other datasets and available public data. In a previous study we showed that methylation at cg05575921 was associated with decreased expression of the AHRR gene both in lung tumour tissue from current smokers and in mouse models of exposure to cigarette smoking 7 .
For the three probes that map within a gene sequence (cg05575921 in AHRR; cg23387569 in AGAP2; cg03636183 in F2RL3) we investigated the correlation between methylation and expression using TCGA (http://cancergenome.nih.gov/) and HapMap (http://hapmap.ncbi.nlm.nih.gov/) data 13 . In the latter, only data for the F2RL3-probe were available. In brief, AHRR-probe methylation seems to be inversely correlated with AHRR expression in lung tumour tissue from TCGA. F2RL3probe methylation does not show methylation-expression correlation in TCGA data but HapMap data suggest a weak inverse correlation (Pearson's correlation coefficient 5 20.28, p value < 0.01). AGAP2-probe methylation seems to be positively correlated with AGAP2 expression in both lung tumour tissue from adenocarcinomas and squamous cell carcinomas (Pearson's correlation coefficient 5 0.49, p value 0.025; and Pearson's correlation coefficient 5 0.55, p value 5 0.15 respectively).
In the study within the ESTHER cohort, the authors estimated that the gain in ability to discriminate between cases and controls by adding methylation levels in cg03636183 in the F2RL3 gene to a model that included smoking with packyears smoked was marginal (<1% increase in the AUC or C-statistics) 14 . The analysis we conducted showed that the inclusion of the methylation level of all 6 CpGs in the prediction model produced an overall gain between 3% and 6% in its discriminatory ability; the gain was as high as 8% to 13% in former smokers. These findings encourage further work to increase the sample size and genome coverage to identify further regions with altered DNA methylation associated with lung cancer risk and use this new information to improve current risk prediction models for lung cancer especially in former smokers and test the new models in terms of both their ability to discriminate between cases and controls and the accuracy of the predicted probabilities.