A candidate CpG SNP approach identifies a breast cancer associated ESR1-SNP
Version of Record online: 11 MAR 2011
Copyright © 2010 UICC
International Journal of Cancer
Volume 129, Issue 7, pages 1689–1698, 1 October 2011
How to Cite
Harlid, S., Ivarsson, M. I.L., Butt, S., Hussain, S., Grzybowska, E., Eyfjörd, J. E., Lenner, P., Försti, A., Hemminki, K., Manjer, J., Dillner, J. and Carlson, J. (2011), A candidate CpG SNP approach identifies a breast cancer associated ESR1-SNP. Int. J. Cancer, 129: 1689–1698. doi: 10.1002/ijc.25786
- Issue online: 26 JUL 2011
- Version of Record online: 11 MAR 2011
- Accepted manuscript online: 23 NOV 2010 02:29PM EST
- Manuscript Accepted: 25 OCT 2010
- Manuscript Received: 1 OCT 2010
- Cancer Control and Prevention using Registries and Biobanks (CCPRB). Grant Number: LSHC-CT-2004-503465
- breast cancer;
- CpG SNPs;
Altered DNA methylation is often seen in malignant cells, potentially contributing to carcinogenesis by suppressing gene expression. We hypothesized that heritable methylation potential might be a risk factor for breast cancer and evaluated possible association with breast cancer for single nucleotide polymorphisms (SNPs) either involving CpG sequences in extended 5′-regulatory regions of candidate genes (ESR1, ESR2, PGR, and SHBG) or CpG and missense coding SNPs in genes involved in methylation (MBD1, MECP2, DNMT1, MGMT, MTHFR, MTR, MTRR, MTHFD1, MTHFD2, BHMT, DCTD, and SLC19A1). Genome-wide searches for genetic risk factors for breast cancers have in general not investigated these SNPs, because of low minor allele frequency or weak haplotype associations. Genotyping was performed using Mass spectrometry-Maldi-Tof in a screening panel of 538 cases and 1,067 controls. Potential association to breast cancer was identified for 15 SNPs and one of these SNPs (rs7766585 in ESR1) was found to associate strongly with breast cancer, OR 1.30 (95% CI 1.17–1.45; p-value 2.1 × 10−6), when tested in a verification panel consisting of 3,211 unique breast cancer cases and 4,223 unique controls from five European biobank cohorts. In conclusion, a candidate gene search strategy focusing on methylation-related SNPs did identify a SNP that associated with breast cancer at high significance.
Although heredity has been estimated to cause between 25 and 50% of breast cancer susceptibility,1, 2 only 5–10% of breast cancer cases are explained by the highly penetrant, but uncommon, genetic variants in e.g., the BRCA1, BRCA2, and TP53 genes.3–5 Other effects appear to be due to combinations of common genetic variants with weakly increased risks at multiple loci.3, 6–9 As the breast undergoes drastic developmental changes during puberty, pregnancy, lactation, and involution, we decided to investigate genetic variants susceptible to regulation by methylation of CpG sequences in genes for hormone receptors and transporter proteins.
Genome wide association studies have identified genetic variants associated with several diseases,10 including breast cancer.6–8 Although the mechanisms whereby some of these genes contribute to cancer risk appear to be clear, many identified variants occur in introns or between genes and have as yet an obscure relation to biological function. We noted that these associations frequently occur at CpG sites within a consensus sequence for binding of Methyl-CpG binding proteins11 and hypothesized that strategies focusing on selection of such SNPs for genotyping could contribute to the further elucidation of the genetic epidemiology of cancer.
Regulatory regions of many genes are enriched with CpG sequences which can be enzymatically modified by cytosine methylation.12, 13 Hypermethylation of these regions has been demonstrated to repress transcription of tumor suppressor and steroid receptor genes associated with breast cancer.14–19 This can be accomplished by decreasing affinity for transcription activating factors,11, 20 by the recruitment of Methyl-CpG-binding proteins such as Kaiso21, 22 or members of the methyl-CpG binding domain (MBD) protein family, i.e., MeCP2, MBD1, MBD2, and MBD4. These proteins act together with corepressor molecules20 by promoting methylation or acetylation of the histone core around which the DNA strand is bound.11, 23 The methylation status of DNA is controlled by DNA methyl transferases, including DNMT1, DNMT3a, and DNMT3b and methyl guanine-DNA methyl transferase (MGMT).24 Methylated or native cytosine may also be deaminated by the enzyme (EC 188.8.131.52) dCMP deaminase (DCTD), explaining why a C-T transition is the most common point mutation in the genome. Deamination may thus potentially result in permanent and hereditable activation or inactivation of coding sequences.25 The extent of DNA methylation is also affected by availability of methyl groups, dependent on access to the B-vitamins folate, B6, and B12 and the enzymes participating in one-carbon metabolism. These are encoded by 5,10-methylenetetrahydrofolatedehydrogenase (MTHFD1), 5,10-methyl-tetrahydrofolate reductase (MTHFR), 5-@MTHF-homocysteine,S-methyl transferase (MTR) and methionine synthase reductase (MTRR) genes, and are also affected by Reduced Folate Transporter 1 (SLC19A1) and betaine-homocysteine methyl transferase (BHMT).
In this pilot study, genes encoding the estrogen receptors α and β (ESR1 and ESR2), the progesterone receptor (PGR), and sex hormone binding globulin (SHBG) were selected. Nonsynonymous or potentially regulating variants in genes coding for proteins involved in methylation mechanisms, (i.e., MBD1, MECP2, DNMT1, MGMT, MTHFR, MTR, MTRR, MTHFD1, MTHFD2, BHMT, DCTD, and SLC19A1) were also selected and analyzed for possible association with breast cancer in a large, joint European biobank-based study.
Material and Methods
The study was performed within the European network of excellence in biobanking Cancer Control using Population-based Registries and Biobanks (CCPRB) and included the following studies:
Malmö diet and cancer study (MDCS)
Female residents of Malmö in southern Sweden, born 1923–1950, were recruited using the population registry to the Malmö Diet and Cancer Study (MDCS), with lack of informed consent as the only exclusion criterion. A total of 17035 women were enrolled between 1991–1996.26, 27 By linking the MDCS to the national cancer register up until 31st of December 2004, 544 prospectively occurring cases (diagnosed after enrollment and free from known breast cancer at enrollment) of invasive breast cancer were identified and matched to 1088 controls according to sex, age, and time of sampling at baseline. Median age at breast cancer diagnosis was 63 years (range 45–81) and 32 cases (63 controls) were ≤50 years of age at time of diagnosis (MDCS1). These samples were used in the screening phase.
In 2008, a linkage to the Regional Tumor Registry ascertained 186 new unique cases diagnosed before 31st of December 2007 (median age 66 years, range 49–84) who were matched with 372 new controls (1 case,2 controls ≤50 years of age) (MDCS2). At this time 11 of the controls used in the screening phase had become cases, they are hence included as controls in the screening phase and as cases in the verification phase of the study. All individuals with prevalent cancers were excluded both from MDCS1 and MDCS2 (cervical cancer in situ was not defined as cancer).
Malmö Preventive Project (MPP)
In the Malmö Preventive Project (MPP), 10,902 women were recruited by population registry-based invitations between 1977 and 1992 to screening for cardiovascular risk and alcohol abuse. They were asked to donate a blood sample and fill out a life-style questionnaire.28, 29 Among those donating blood for DNA analyses at re-examination (2002–2006), 215 prospective invasive breast cancer cases (median age 62 years, range 32–79, 25 age ≤ 50 years) were identified by cancer registry linkage up until 31st of December 2007 and 430 controls (50 ≤ 50 years) were matched for age, sex, and time of sampling at baseline. None of these women were participants in MDCS and individuals with prevalent cancers were excluded (cervical cancer in situ was not defined as cancer).
North Sweden Health and Disease Study (NSHDS)
An additional 1,680 cases (median age 56 years, range 27–95) and 2,369 controls matched for age, sex, and time of sampling were from the NSHDS (North Sweden Health and Disease Study) (475 cases and 635 controls ≤50 years of age). This cohort consists of individuals residing in the counties of Västerbotten and Norrbotten in northern Sweden. Two NSHDS subcohorts, the Västerbotten Intervention Program (VIP) and the Mammography Screening in Västerbotten are included in this study. The VIP was initiated in 1985 and all individuals are invited for screening at 40, 50, and 60 years of age and asked to donate a blood sample for research purposes. After 10 years a follow up sample is collected. Currently, there are more than 83,000 blood samples from 70,000 participants in the VIP. Samples taken at mammography screening have been stored since 1995 and screening is done among women in the ages of 50–69 years. This subcohort contains 48,000 samples from 27,500 women.29 Individuals with prevalent cancers were excluded (cervical cancer in situ was not defined as cancer).
A total of 866 cases (median age 55 years, range 25–93, 314 ≤ 50 years) and 948 controls (median age 58 years, range 26–98, 256 ≤ 50 years) were included, representing 45–77% of all Icelandic women with breast cancer diagnosed between 1957 and 2007, with samples collected between 1998 and 2006. Five hundred and seventy individuals were ≤50 years. Controls were collected between 2000 and 2004, either from women who participated in the population-based cervical or breast cancer screening program and found free of breast cancer or from older women in retirement homes who had not been diagnosed with breast cancer, to generally reflect the ages of the cases. For 410 cases and 453 controls from Iceland, DNA became available late in the study and these samples were analyzed on a diminished version of the original SEQUENOM array.
Three hundred ninety-one cases (median age 46 years, range 22–81, 315 ≤ 50 years) with early onset or familial breast cancer were recruited at the genetic counselling clinic in Silesia between 1997 and 2006 (cases ≤50 years were collected between 2002 and 2006). Samples from 306 unmatched controls (median age 43 years, range 18–71, 233 ≤ 50 years) were collected between 2003 and 2009 from healthy women attending the same clinic, but who had no or sparse family history of breast cancer.
Figure 1 shows an overview of all populations and their recruitment into our study.
Each sample was donated to its respective biobank with informed consent for biobank-related research, and this study was approved by an ethical institutional review board in each participating country. Within all cohorts, women of age >50 are regarded as postmenopausal.
Totally the verification phase consisted of 7,763 samples (3,338 cases and 4,425 controls).
All SNPs, creating or destroying a CpG site, in the BRCA1, BRCA2, ESR 1, ESR 2, PGR, and SHBG candidate genes in the dbSNP database (http://www.ncbi.nlm.nih.gov/, October 2006) within 5′ regulatory regions were selected regardless of minor allele frequency, and those with heterozygosity ≥ 5% (CEU of northern and western European population) were selected in other regions. In the case of uncertain or multiple transcriptional start sites this region was extended ≥ 10 kb in the 5′ direction. The same selection criteria were used for genes relevant for the methylation process (MBD1, MECP2, DNMT1, MGMT, MTHFR, MTR, MTRR, MTHFD1, MTHFD2, BHMT, DCTD, and SLC19A1), with the addition of nonsynonymous coding SNPs. Presence in the HapMap database was not a prerequisite for inclusion in the study, but all SNPs were examined for documented linkage in the HapMap data base (http://www.hapmap.org/, October 2006) and, in case of documented linkage (D′ > 0,95) one SNP per haplotype block was selected. For an overview of the SNP selection process see Figure 2.
The SEQUENOM MassARRAY®Designer software was used for multiplex SNP analysis design. Forty-seven SNPs failed in the design program due to sequence characteristics and two SNPs were excluded due to multiplex logistics (28-plex). The remaining 230 SNPs were analyzed in the Pilot Study.
Mass spectrometry (MS) SNP analyses
SNP analyses were preformed on a MALDI-TOF mass spectrometer (SEQUENOM MassArray) using iPLEX reagents and protocol (SEQUENOM) and 10 ng DNA as PCR template. Primer sets were from Metabion (Martinsried, Germany). Reagents and protocols designed for up to 28-plexes were used in the pilot and up to 36-plexes in screening and verification.
Analysis of the initial MDCS1 samples included in the screening phase was repeated in the verification phase. In addition, about 5 % of the samples from Iceland and 8% of the Polish cases were included as blinded duplicates in the verification phase.
Analyses on duplicate samples within the screening phase (3.9% of all analyses) and between the pilot and the screening phase (3.0% of all analyses) were both in 99.9% concordance.
All duplicates within the verification phase were in 100% concordance. Results from 1605 samples from MDCS analyzed in both screening and verification were in 99% concordance.
All SNPs were tested for consistency with Hardy-Weinberg equilibrium (HWE) within the control sample using a χ2p value cutpoint of 0.001. Logistic regression models were used to measure the association between SNPs and the risk for breast cancer. For each SNP, Ors, and 95% CIs were calculated for each genotype and per allele ORs and 95% CIs was calculated using 0, 1 or 2 copies of the minor allele (a) as a continuous variable. OR were also adjusted for age and individuals were stratified into two groups (< = 50 years and >50 years). All results were stratified by cohort using the Mantel-Haenszel method.
Of the 230 SNPs selected in candidate genes, the 91 most frequent were tested on 47 breast cancer cases and 48 matched controls from the MDCS1 cohort, and the 139 less frequent SNPs were tested on 95 cases and 96 matched controls to verify genetic variation in our population and to exclude multiple SNPs in complete linkage. We were hereby able to exclude 101 SNPs as specified in Figure 2.
Screening included all 129 SNPs that had passed the pilot with addition of 13 SNPs specifically creating or destroying a consensus MeCP2 binding site in the initial candidate genes and 31 SNPs in the genes involved in methylation pathways. These 44 SNPs were preselected from the literature and therefore not included in the pilot. The resulting 173 SNPs were analyzed on the primary MDCS cohort (MDCS1).
From the 173 SNPs screened, 19 SNPs were selected for verification on the basis of at least borderline significance and/or OR point estimates higher than 2.5.
The 19 SNPs that passed screening were successfully analyzed in an independent panel of 7,434 samples (3,211 cases and 4,223 controls) from five different cohorts: (MDCS, MPP, NSHDS, POLAND, and ICELAND). For logistic reasons, only 10 SNPs were tested in 410 cases and 402 controls having late delivery from Iceland. Control genotypes for 4 of the 19 SNPs failed HWE and were therefore excluded from further analysis. Fifteen remaining SNPs were evaluated for association with breast cancer (Table 1).
Of these fifteen SNPs rs7766585 in ESR1 demonstrated a highly significant association with breast cancer in the independent verification material, with an unadjusted point estimate OR (95%CI) of 1.30 (1.17–1.45) for heterozygotes (p = 2.1 × 10−6). The point estimates varied somewhat between the cohorts (Fig. 3) with significant associations for northern Sweden (NSHDS), Iceland and Poland but nonsignificant results for southern Sweden (MDCS2, MPP). The significance was somewhat increased after applied Mantel-Haenszel statistics to control for cohort (Table 2).
One other SNP in ESR1 (rs851987) tended to be weakly associated with breast cancer (p = 0.03, unadjusted for multiple comparisons).
Only 31 SNPs (in the dbSNP database) from genes involved in the methylation process matched our inclusion criteria. None was present in ZBTB33 (coding for Kaiso). Two SNPs in DNMT1, one in MGMT and one in MDB1 were considered sufficiently interesting to be included in the verification phase but none of them retained significance, although the C allele of rs12957023 in the first intron of MBD1 shows a tendency towards increasing the risk of breast cancer (OR 1.32 95% CI 0.96–1.80) in homozygous women under age 50, and thus may merit further investigation.
To avoid missing sites of potential down regulation of BRCA1 and 2 genes, CpG sites in their promoter regions were also analysed but the two significant findings in the screening phase were eliminated in the verification phase.
Through hypothesis-based screening and validation of 173 candidate SNPs, we found one genetic marker that was associated with breast cancer. Our strategy for SNP selection was based on the hypothesis that hereditable methylation potential could be associated with breast cancer. The hypothesis is also compatible with the weak familial clustering frequently observed in the postmenopausal majority of breast cancers and our own observation that many of the significant SNPs discovered in GWA studies are noncoding CpG SNPs. It is possible that methylated CpGs embedded in consensus sequences may attract methyl binding proteins such as Kaiso, MeCP2 and MBD1–4, which in turn may block transcription through interaction with core histone proteins.
The main finding in this study is a SNP (rs7766585) situated in intron six of ESR1, belonging to a linkage group consisting of more than 27 SNPs in the same vicinity. Dunning et al.30 performed a large SNP-tagging study of the ESR1 gene to look for possible associations with breast cancer and discovered one SNP in intron four (rs3020314) that weakly enhanced cancer risk (OR 1.05 95%CI 1.02–1.09). They tagged SNPs in linkage with rs7766585 but none of those SNPs passed the initial screening and were not examined further.
In our case, the associated risk is seen only in heterozygous carriers, which is not easily explained. The MAF is somewhat low (0.085 in the HAPMAP CEU population and 0.098 in our verification panel) but not enough to account for the deviant result amongst homozygote carriers of the G-allele. In the screening phase of the study rs7766585 was selected based on the high OR point estimate (7.79) for G allele homozygotes, but this disappeared in the verification phase. Also, the association amongst heterozygotes was not found in the screening phase (unadjusted OR 0.91 95%CI 0.71–1.15) but was clearly increased in the verification phase (unadjusted OR 1.30 95%CI 1.17–1.45).The per allele OR, (Mantel Haenszel adjusted) comprising our primary hypothesis, of 1.27, 95%CI 1.14–1.41 with a p-value of 9.5 × 10−6, clearly passes the Bonferroni multiple comparisons significance threshold when adjusting for 15 different SNPs (p < 0.003). Although it is thus unlikely that the association could be due to chance, even larger studies that could further define the breast cancer risk among both heterozygotes and homozygotes are clearly warranted.
One other ESR1 SNP (rs851987) among the 15 deserves mentioning. The T allele tended towards protection in the screening phase (unadjusted per allele OR 0.92 95%CI 0.79–1.08), and the results were similar in the verification phase (unadjusted per allele OR 0.92 95%CI 0.85–0.99). The p-value (0.03) does not pass the Bonferroni-correction threshold when adjusting for multiple comparisons (15 SNPs), but this SNP nonetheless remains interesting because of its location 3.7kb 5′ of the ESR1 transcriptional initiation site and the fact that it is a CpG altering SNP and a potential binding site for the MeCP2 protein. If this SNP actually makes a functional methyl-binding site dysfunctional by substituting the C for the T it could theoretically increase the expression of ESR1 since the site would be less prone to attach to methyl-binding proteins that inhibit transcription. The potential association of this SNP with breast cancer should be verified in a much larger cohort.
Several SNPs in the promoter region and first intron of MTHFR, as well as the missense SNPS 677 C-T (rs1801133) and 1298 A-C (rs1801131) were included in the initial screening panel. These were excluded after screening as they did not even approach significance. However, an association with breast cancer has been shown among 677 T carriers with high folate levels in the MDCS cohort.31, 32
The study design using samples and cases from three different countries and from five different population registries involves both strengths and weaknesses. Although the Swedish NHSDS and MDCS cohorts have perfectly matched controls in their prospective population-based content, a slight selection towards enrollment of subjects with higher socioeconomic status than the general population can be seen in the MDCS.27 Furthermore, MDCS particpants were recruited at age 45–65 years. The exclusion of prevalent cases removes early breast cancer cases from this population. Although the NHSDS participants were primarily included from age 40 and upwards, mammography screening had identified some case as young as 27 years. Both cases and controls in Swedish cohorts are perfectly matched for age and duration of follow-up. In Iceland mainly prevalent cases of breast cancer were included, with sample collection occurring long after initiation of case acquisition, which will have resulted in an exclusion of lethal cases, and older women with other causes of death from the earliest recruitment period. A similar bias is present in the MPP cohort despite prospective population-based design, as DNA samples were acquired at a delayed follow-up. It is therefore possible that these two study-populations are biased towards breast cancer cases with more favorable outcome. The Polish cases are recruited from families with multiple breast cancer cases, or because of early onset of breast cancer. We would expect the presence of other highly penetrant risk factors to overshadow the mild effect expected within our methylation hypothesis. It is therefore of interest to note that although the point estimates of the ORs for SNPs identified in this study varied somewhat, albeit not significantly, between the cohorts, general tendencies prevailed throughout the cohorts. Although not significant due to small numbers, a secondary analysis pivoting on median age (ca 60 years) demonstrated higher point estimates in the younger half of each cohort for C/T heterozygotes of rs7766585 (results not shown).
To identify new possible alleles increasing breast cancer risk several different approaches have been undertaken. With the realization that large scale association studies were needed to identify the low-risk variants presumably responsible for a large proportion of the familiar breast cancer cases, initial focus on candidate-gene-association studies proved disappointing. Despite many efforts with large cohorts only one SNP (CASP8 D302H) has achieved satisfactory significance level (p = 1.1 × 10−7) using this approach.6 Several other findings have been reported but failed to be replicated.4 Subsequent studies with the GWA-approach have both identified and confirmed many new risk loci.4, 5 Even though it has met with criticism33 most now agree that this approach is needed if we are to discover new susceptibility genes whose functions are crucial but still unknown.
We undertook an approach, limited by cost restrictions, that differs from previous studies where we primarily focused on candidate CpG SNPs (affecting methylation potential) in candidate genes such as hormone receptors and known susceptibility genes as well as several within methylation pathways. Through this approach we were able to identify a new possible risk SNP in the ESR1 gene, whose locus has previously been implicated as a hotspot for breast cancer risk affecting alleles.30, 34 In addition, the specific structure of the investigated cohorts suggests that this variant is more important for younger than for older breast cancer cases.
We believe that our model, specifically designed to identify genetic risk that may interact with lifestyle, could possibly be expanded to generate directed panels of SNPs with epigenetic potential for future identification of risk mechanisms. Genome wide screening for CpG SNPs within specific sequence contexts that allow or prevent the binding of methyl-CpG binding proteins may provide a useful complement in the search for new risk alleles both in breast cancer and other forms of cancer.
We thank Anders Dahlin (MDCS and MPP), Gudridur Olafsdottir, and Laufey Trygvadottir (Iceland) for data retrieval, Holmfridúr Hilmarsdottir (Iceland), Åsa Ågren (NSHSD), Jolanta Pamula-Pilat, and Karolina Tecza (Poland) for sample retrieval and handling and Maria Sterner and Liselott Hall at RSKC (Malmö) research facility for technical assistance.