The impact of next generation sequencing on the analysis of breast cancer susceptibility: a role for extremely rare genetic variation?


  • FSM Hilbers,

    Corresponding author
    1. Department of Human Genetics, Leiden University Medical Centre, Leiden, The Netherlands
    • Corresponding author: Florentine SM Hilbers, Department of Human Genetics, Leiden University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands.

      Tel.: (+31) 071-5269515;

      fax: (+31) 071-5268285;


    Search for more papers by this author
  • MPG Vreeswijk,

    1. Department of Human Genetics, Leiden University Medical Centre, Leiden, The Netherlands
    2. Department of Toxicogenetics, Leiden University Medical Centre, Leiden, The Netherlands
    Search for more papers by this author
  • CJ van Asperen,

    1. Department of Clinical Genetics, Leiden University Medical Centre, Leiden, The Netherlands
    Search for more papers by this author
  • P Devilee

    1. Department of Human Genetics, Leiden University Medical Centre, Leiden, The Netherlands
    2. Department of Pathology, Leiden University Medical Centre, Leiden, The Netherlands
    Search for more papers by this author

  • The authors declare no conflicts of interest.


Women with a family history of breast cancer have an approximately twofold elevated risk of the disease. Even though an array of genes has been associated with breast cancer risk the past two decades, variants within these genes jointly explain at most 40% of this familial risk. Many explanations for this ‘missing heritability’ have been proposed, including the existence of many very rare variants, interactions between genetic and environmental factors and structural genetic variation. In this review, we discuss how next generation sequencing will teach us more about the genetic architecture of breast cancer, with a specific focus on very rare genetic variants. While such variants potentially explain a substantial proportion of familial breast cancer, assessing the breast cancer risks conferred by them remains challenging, even if this risk is relatively high. To assess more moderate risks, epidemiological approaches will require very large patient cohorts to be genotyped for the variant, only achievable through international collaboration. How well we will be able to eventually resolve the missing heritability for breast cancer in a clinically meaningful way crucially depends on the underlying complexity of the genetic architecture.

The genetic landscape of breast cancer

Genetic variation in over 75 loci has been significantly associated with breast cancer risk the past 20 years, either by linkage studies in multiple-case families, genome-wide association scans, or candidate gene mutation scanning. Despite this impressive progress, currently known risk alleles explain only about 40% of familial breast cancer risk [1]. These known risk alleles can be roughly subdivided in three groups based on their relative breast cancer risk and population frequencies. Broadly speaking, these are very rare high risk alleles, common low risk alleles, and an intermediate group of ‘uncommon’ alleles associated with a two- to threefold elevated risk (see Fig. 1 and Box 1 for definitions). The existence and potential impact of a fourth group, rare to very rare low risk alleles, has thus far remained in the realm of speculation, as these are extremely difficult to detect.

Figure 1.

The genetic landscape of breast cancer. This figure shows the allele frequencies and risk distributions for the currently known breast cancer risk alleles. Genes represented by a filled diamond indicate a joint allele frequency, and an average risk associated with all observed alleles/mutations in this gene. Hence there might be variants within this gene that are associated with a much higher or lower risk than this given average, and individual risk alleles will have much lower allele frequencies than suggested by the value on the x-axis. In the case of genes in which the breast cancer associated variants are very rare, an approximation of the risk is given as there is still much uncertainty about these risks. Genes indicated by an open triangle represent a single variant for which the associated breast cancer risk was derived from case–control analyses. In addition to the genes depicted in the graph, there are several others (e.g. XRCC2, RAD51C and BARD1) in which variants have been implicated in breast cancer susceptibility, but for which allele frequency and/or quantitative risk estimates have not yet been established. To avoid misleading suggestions, we have chosen to exclude those from this figure.

BRCA1 and BRCA2 remain the two most significant genes to date. Mutations in these genes confer high breast cancer risks and explain approximately 20% of familial breast cancer [2-4]. The spectrum of mutations found in either gene is extremely complex, with almost 2000 unique mutations (or ‘alleles’) having been documented for each [5]. Interestingly, over half of these have been detected only once so far, and thus represent extremely rare population frequencies, whereas other have been found recurrently, often within specific

Box 1 Rare alleles

The terms rare and very rare are used somewhat arbitrary in literature. In this review we use the following alternative allele frequencies as cut-offs:

  • Common (50–5%)
  • Low frequency (5–0.5%)
  • Rare (0.5–0.05%)
  • Very rare (<0.05%)

ethnic minorities as a result of a founder effect. Well known examples include the BRCA1 c.66_67delAG mutation and the BRCA2 c.5946delT and BRCA2 c.771_775delTCAAA mutations, which have been detected in 0.5–1% of the general populations of Ashkenazi Jews, and Iceland, respectively. The first estimates of breast and ovarian cancer risks conferred by BRCA1 and BRCA2 were linkage-based, i.e. regardless of mutation-type. These analyses indicated that BRCA1 mutations confer a cumulative breast cancer risk of 87% by age 70, and an ovarian cancer risk of 45% by age 70. A wealth of data on mutation carriers have since documented that patterns of risk change significantly for both BRCA1 and BRCA2 mutations conditional on their relative position within the gene [6, 7]. These data show that the relation between mutation-position and cancer risk is complex, and as such provide an example as to what to expect at other susceptibility loci for which it will be more difficult to obtain such a large dataset. The classical view is that mutant alleles that inactivate BRCA1 or BRCA2 cellular functions, e.g. through premature protein-truncation or aberrant mRNA-splicing, confer high breast and ovarian cancer risks. Two lessons can be gleaned from the BRCA1/2 case, to be taken into account when trying to interpret very rare variants in other breast cancer susceptibility genes. First, while the majority of mutations in BRCA1 and BRCA2 indeed seem to functionally inactivate the resulting protein, very few analyses have been done, besides for the few well known founder mutations, to actually establish that each unique mutation causes a high risk of breast and ovarian cancer. Instead, the risk associated with very rare mutations has been inferred from the joint analyses of all these alleles in families (also called ‘burden analysis’, see below). Yet, in the daily clinical practice of counseling familial breast cancer, every protein-truncating variant in BRCA1/2 is interpreted as ‘high risk’. Second, several notable deviations from the expected risk-patterns have been detected, in that some mutations in BRCA1 or BRCA2, cause either a low (e.g. BRCA2 p.Lys3326*) [8] or moderately increased risks to breast cancer (e.g. BRCA1 c.Arg1966Gln) [9]. This implies that for many very rare variants for which the effect on protein function is less clear a causal relation with breast cancer is hard to establish.

Other genes with high risk breast cancer alleles were discovered because of their strong association with familial cancer syndromes in which breast cancer was one of the defining components, and include the genes TP53 (Li-Fraumeni syndrome) [10], PTEN (Cowden syndrome) [11], STK11 (Peutz-Jeghers syndrome) [12, 13] and CHD1 (hereditary diffuse gastric cancer syndrome) [14-16]. Mutations in these genes are very rare in breast cancer families that do not fit the clinical criteria for these syndromes [17-21], but are associated with a 2- to 10-fold increased risk of breast cancer. Likewise, ATM was an immediate candidate to contain breast cancer susceptibility alleles. Ataxia telangiectasia, caused by ATM mutations, is a recessive disorder, and mothers of patients had already been documented to have an elevated breast cancer risk [22].

Once it became clear that BRCA1 and BRCA2 encode proteins with a role in the DNA damage response to double-strand DNA breaks, and could be connected to a gene network underlying the recessive disorder Fanconi anemia (FA) [23], re-sequencing of other genes constituting these pathways in large patient cohorts resulted in the discovery of a second group of genes in which variants are associated with a more moderately increased risk of breast cancer. This group includes the genes ATM [24], CHEK2 [25], BRIP1 [26] PALB2 [27] and NBS1 [28]. Most variants in these genes are thought to be associated with an approximately twofold increased risk. Some of these variants are relatively common, with allele frequencies in the general population up to 1%. However, most variants are very rare and their relation with breast cancer could only be established by a burden analysis pooling very rare variants and comparing their combined frequency between cases and controls. Accordingly, they explain a relatively small proportion of familial breast cancer [27]. Other genes for which associations have been found through candidate gene approaches are FAM175A (Abraxas) [29], BARD1 [30, 31], RAD51C [32], MRE11 [33] and RAD50 [34]. However, evidence for these genes is still limited and sometimes contradicting [35-42]. The last group of genetic variants associated with breast cancer consists of low risk variants of which the minor allele frequency is usually higher than 5%. Currently over 60 of these common low risk variants have been identified through large genome-wide association studies with the per-allele odds ratios for the risk allele ranging from 1.02 to 1.27 [8]. Interestingly, these low risk variants are usually found outside protein coding regions and most cannot directly be linked to a gene encoding a DNA damage response-related protein.

Missing heritability

Multiple models have been proposed to explain the ‘missing heritability’ in breast cancer. After the discovery of BRCA1 and BRCA2, segregation studies have suggested that a polygenic model most probably explains the majority of remaining familial risk [43-45]. The inability to identify a third ‘BRCA1/2-like’ gene by several genome-wide linkage analyses suggests that additional high risk alleles are probably very rare and scattered across several loci. As a result of recent extreme human population growth, the distribution of variant allele frequencies is strongly skewed towards very rare variants [46]. Conceivably, these variants could confer high, low or moderate breast cancer risks, but genetic drift and selection have not yet had the time to mould their allele frequencies. Jointly therefore, they potentially explain a major part of the polygenic risk of breast cancer. Candidate gene association studies have been underpowered to detect these variants and the single-nucleotide polymorphism (SNP) arrays typically used in genome-wide association studies have been designed to only tag common variation. A specific class of rare variants that might contribute to familial breast cancer risk are structural variants like large deletions and copy number variations. These variants are not detected by most genotyping methods used in association studies. Instead, a sufficiently sensitive detection would require a technique (or combination of techniques) that assesses both break point sequences and copy number variation in a genome wide manner.

Finally, interactions between genetic variants and/or environmental risk factors remain a largely unexplored area as most case–control studies that could address this are typically underpowered to detect such associations. Even very large cohort-studies such as those undertaken by the Breast Cancer Association Consortium have been able to investigate only two-way interactions. This consortium has now reported a few significant interactions between common low risk variants and environmental risk factors [47]. In addition, many of the common SNPs associated with breast or ovarian cancer risk have been found to be able to modify the risks conferred by BRCA1 or BRCA2, indicating that genetic interactions in fact do exist [48]. To detect additional interactions, (environmental) risk factor data has to be collected for even larger cohorts and rare variants might have to be pooled based on their effect on a specific gene or pathway.

Identifying new breast cancer risk alleles

Although a number of additional breast cancer genes have been identified by candidate gene approaches, an important disadvantage is that this approach is limited by our current understanding of the pathways involved in breast cancer pathogenesis. Next generation sequencing (NGS, Box 2) has brought the opportunity to discover rare alleles associated with breast cancer risk in a more agnostic way by allowing researchers to sequence whole exomes or even whole genomes. However, also this approach comes with challenges. Most studies applying NGS in familial breast cancer cases have taken a whole exome sequencing approach [49-53]. A typical exome sequencing experiment results in tens of thousands of heterozygous variants, which somehow will have to be reduced to a number manageable for validation. Many bioinformatics tools exist to assist these complex analyses; however, many different combinations of tools and settings for data analysis and variant selection are reported in literature. Common strategies include focusing on variants predicted to result in a truncated protein or missense changes probably to affect protein function by in silico prediction algorithms. Also the removal of variants with an allele frequency of more than 1% (in publicly available databases) is a common filtering step. However, all these filtering steps come with the risk of discarding a causal variant.

Box 2 Next generation sequencing

Next generation sequencing is a common used term for a set of sequencing techniques that are also known as Massively Parallel Sequencing (MPS). With this technique many different sequences can be determined in one reaction. For most experiments genomic DNA is randomly fragmented. Subsequently, these fragments can be enriched for all coding regions (exome) or a specific set of disease associated genes (gene panels). These fragments are then loaded on a chip and for each location on this chip, corresponding with a specific fragment, the sequence is determined. For the actual sequencing many different techniques exist, each with their own error rates and artifacts which have to be taken into account in the analysis. In the end, every position in the genome or region of interest will have been sequenced multiple times. These reads will have to be combined to a consensus sequence. The accuracy of this consensus sequence depends on how many times a position is sequenced, also known as the sequencing depth.

Some exome sequencing studies have suggested FANCM [49], BLM [53], FANCC [53] and XRCC2 [52] as potential new breast cancer genes, while others have not reported likely new risk alleles [50, 51]. Of note, none of the exome sequencing studies highlighting new genes provide conclusive evidence for their involvement in breast cancer, but rely on previous data and functional connotation to support their candidacy. FANCC and FANCM are both FA genes and were obvious candidates ever since BRCA2 was found to be a FA gene [54]. An association between FANCC mutations and breast cancer had been reported before [55], but for FANCM the available data are conflicting [56]. Likewise, XRCC2 has been suggested to be a FA gene [57], but the association between XRCC2 variants and familial breast cancer was not detected by two other case–control studies [58, 59]. Mutations in BLM are known to cause Bloom syndrome, a very rare recessive disorder characterized by short stature and high incidence of multiple cancers [60]. An association between heterozygous BLM mutations and breast cancer has been reported before, although not all truncating mutations seem to be associated with a similar risks [61-64].

An interesting alternative to the exome sequencing approach is a study by Ruark et al [65], who used NGS to analyse a gene-panel of 507 genes implicated in DNA damage response to search for new familial breast cancer associated genes. This allowed them to sequence more cases and generate a deeper coverage at each DNA-base sequenced. Focusing on protein truncating variants, they reported such variants in the PPM1D gene to be significantly associated with breast and ovarian cancer, but at very rare allele frequencies. Interestingly, all protein truncating variants were mosaic in blood lymphocyte DNA, while functional analyses revealed that the tumorigenic mechanism probably does not comply with that of simple tumor suppressor gene inactivation.

Owing to high sequencing costs only a limited number of samples have been sequenced to date. This hampers selection of variants or genes on basis of their variant frequencies in cases and controls. When sequencing costs drop further and the number of sequenced samples increases, the full potential of NGS in finding new breast cancer genes would be exploited if, through international collaboration, large numbers of cases would become accessible for sequence comparisons. The discovery of extremely rare breast cancer risk alleles would be particularly enhanced, as no single research centre is predicted to amass sufficient numbers of cases to detect these. However, in order to pool NGS data in an efficient and meaningful manner, consensus needs to be reached on data analysis strategies. At the moment consortia are being formed to pool NGS data in order to increase power [66]. As we can also expect a shift from exome sequencing to whole-genome sequencing, this will give more insight into the role of structural variants and variants in non-protein-coding regions.

Detecting the effect of very rare variants

While NGS will identify many potential variants conferring breast cancer risk, assessing this risk purely on sequencing data usually has insufficient statistical power. Therefore, the next step is usually to perform a case–control study, examining the allele frequency of a specific variant in cases and controls. This strategy can be very successful if it is relatively common in the population under study, for example in the case of CHEK2 c.1100delC [25]. However, most potentially causal variants are very rare even among familial cases, making it necessary to genotype a very large number of cases and controls. For example, in order to detect the effect of a single variant with an allele frequency of 0.05% and a relative risk of two with 80% power and an alpha of 5%, at least 22,000 cases and an equal number of controls are needed [67] In some cases, the efficiency of a case–control study might be improved by selecting subjects from a certain geographical region or with a specific phenotype. For example, mutations in BRCA1, BRCA2 [68, 69], BRIP1 [70] and RAD51C [32] are also associated with an increased risk of ovarian cancer. Selecting cases from families with both breast and ovarian cancer could increase the power of a case–control study assessing variants in these genes.

A typical ‘burden analysis’ potentially also increases the power of a case–control study. By pooling variants that are probable to affect the function of the gene of interest and comparing the total number of these variants in cases and controls, associations can be found for a gene in which individual variants are too rare for risk assessment. This type of analysis has been used successfully in the case of most genes with moderate risk alleles found through candidate gene studies [24, 26, 27]. However, predicting the functional effect of a variant is not straightforward. Although a number of in silico tools exist that can predict the effect of a variant, these prediction tools might misclassify some missense variants [71]. Erroneous inclusion of neutral variants will dilute the effect of truly pathogenic variants and thus decrease statistical power. In addition, allelic risk heterogeneity, as already set forth above for BRCA1 and BRCA2, complicates the interpretation of the risk estimated from burden analysis, as this risk might not apply to all pooled variants. An extreme example has also been documented for ATM, in which pooled truncating variants are associated with a moderate risk ratio of around 2.3, but in which a single missense variant thought to have a dominant-negative effect on ATM function, p.V2424G, has been found associated with a more than 10-fold increased risk [72, 73].

Functional analysis of genetic variants can be very helpful to select variants with functional effect that can be grouped in a burden analysis. For BRCA1 and BRCA2 many such functional assays exist [74]. Also for variants in other breast cancer genes results from functional assays have been reported [75-79]. However, functional data alone are never sufficient to determine if a variant increases breast cancer risk. It is important to keep in mind that most breast cancer associated genes have many cellular functions and not all of these might be equally important for breast cancer risk. In addition, functional assays might result in variants that do not clearly cluster with either neutral or pathogenic variants. The clinical relevance of such variants is hard to determine.

On the basis of the classical two-hit model [80, 81], many studies assess loss of the wild type allele in the tumor when examining a potential breast cancer risk allele. Although loss of the wild type allele seems to occur in most (but not all) tumors in BRCA1/2 mutation carriers [82-85], for other genes loss of the wild type allele is less frequently seen. For example, Goldgar et al. [73] showed that only 1 of 18 breast tumors of carriers of an ATM mutation, showed loss of the wild type allele, while 6 showed loss of the mutant allele. Other studies confirm that loss of the wild type allele seems not necessary for carcinogenesis in ATM mutation carriers [72, 86]. Although there is one ATM variant suggested to have a dominant-negative effect [72], other variants might be haploinsufficient. In general, if the wild type allele is lost more often than the mutant allele, this can be regarded as evidence for pathogenicity, but the lack of it provides little evidence against pathogenicity.

Another way of assessing risk is to analyze co-segregation of the variant with breast cancer in the family where it was found [87], which has been applied to very rare variants detected in BRIP1 and PALB2 [26, 27]. However, unless multiple families with the same variant can be analyzed in this way, or the variant shows (near-)perfect co-segregation in a very large family with many cases of breast cancer, this approach is unlikely to provide accurate quantitative estimates of the risk conferred by the variant. Nonetheless, family-based analyses, despite the practical difficulties surrounding the sampling of family-members of the proband, remain attractive because of their better statistical power over case–control analysis in the general population, and because of their immediate relevance for clinical genetic counseling purposes. Here too, international collaboration will be pivotal to enable convincing evidence to be compiled. Specific gene variant databases exist for most genes in which variants have been found to be associated with breast cancer risk. By collating variant data from all over the world, the number of variants for which enough data is available to perform co-segregation analysis will increase. For BRCA1 and BRCA2 the ENIGMA consortium is dedicated to classifying variants of uncertain clinical significance [88]. Similar collaborations will be useful in the classification of variants in other breast cancer associated genes.

How will NGS impact clinical genetic counseling for breast cancer?

For clinical genetic services, NGS offers the possibility to screen additional risk loci at minimal additional costs. Therefore, many centers now consider the use of gene panels containing all known breast cancer associated loci for mutation screening. However, detection of variants in these additional risk loci introduces new challenges [89]. For many very rare variants, or combinations of variants, much uncertainty about the associated risk exists. Therefore, mutational screening of these genes should initially take place in a research setting until these can be established with at least some accuracy. A sobering fact is also that even for established high risk genes such as BRCA1 and BRCA2, a variant of uncertain clinical significance (VUS) is uncovered in about 13% of new families tested [90], which comprise approximately 40% of families testing ‘positive’. When additional genes are included in the test panel for familial breast cancer, the percentage of families with a VUS in at least one of these genes will increase strongly (Fig. 2). Even though most methods discussed above can be used to assess the risk associated with these VUS, the lack of an epidemiological support, such as for BRCA1/2, will strongly limit the clinical utility of these test-results.

Figure 2.

Test results from different mutation screening strategies in the clinic. This figure shows the correlation between mutation screening strategy and the distribution of test results. (a) The screened genes for the different mutation screening strategies and the corresponding number of screened coding base pairs. (b) The distribution of test results for BRCA1/2 mutation screening based on Frank et al. [90]. The number of variant of uncertain clinical significance (VUSs) for the gene panel mutation screening was calculated under the assumption that the rate of VUS/base pair for the additional genes would be similar to that of BRCA1 and BRCA2. Note that with the gene panel approach the expected increase of pathogenic mutations will be relatively small compared to the strong expected increase in the number of individuals with a VUS.

At some point in the not too distant future, DNA diagnostic laboratories might also start to use whole-exome or genome analyses. This is expected to result in large numbers of ‘hard to interpret’ variants. As stipulated above, many potential explanations for the missing heritability of breast cancer still exist, and until we know more about these, it will remain extremely difficult to make clinical inferences on the basis of very rare variant data. On the bright side, however, NGS offers an unprecedented opportunity to get more insight into this genetic architecture. Much of this will initially remain within the realm of research, rather than clinical application, and will strongly depend on international collaboration and data collation in the public domain. Functional laboratory assays should complement genetic epidemiological data, as it did for BRCA1 and BRCA2. Thus clinical genetic testing using NGS could greatly assist the advancement of knowledge on the genetic complexity underlying familial breast cancer. However, genetic counselors should prepare for a situation in which very few of the detected very rare variants in genes other than the high risk genes we know today are ‘actionable’ in terms of disease management.


The authors were supported by the Dutch Cancer Society (grant UL 2009-4388)