Establishing a valid approach for estimating familial risk of cancer explained by common genetic variants

We critically examined existing approaches for the estimation of the excess familial risk of cancer that can be attributed to identified common genetic risk variants and propose an alternative, more straightforward approach for calculating this proportion using well‐established epidemiological methodology. We applied the underlying equations of the traditional approaches and the new epidemiological approach for colorectal cancer (CRC) in a large population‐based case–control study in Germany with 4,447 cases and 3,480 controls, who were recruited from 2003 to 2016 and for whom interview, medical and genomic data were available. Having a family history of CRC (FH) was associated with a 1.77‐fold risk increase in our study population (95% CI 1.52–2.07). Traditional approaches yielded estimates of the FH‐associated risk explained by 97 common genetics variants from 9.6% to 23.1%, depending on various assumptions. Our alternative approach resulted in smaller and more consistent estimates of this proportion, ranging from 5.4% to 14.3%. Commonly employed methods may lead to strongly divergent and possibly exaggerated estimates of excess familial risk of cancer explained by associated known common genetic variants. Our results suggest that familial risk and risk associated with known common genetic variants might reflect two complementary major sources of risk.


Introduction
In the era of genome wide association studies (GWAS), many single nucleotide polymorphisms (SNPs) have been found to be associated with a higher risk for various types of cancers. [1][2][3] With increasing sample size of the GWAS consortia, the power to detect common variants with small effects has rapidly increased, and discovery of several SNPs at once is now the rule rather than the exception. 4,5 Many of the SNPs discovery studies included estimates of how much of the familial risk of cancer can be attributed to previously known and the newly discovered genetic variant(s). They mostly employed an equation first proposed by Cox et al. in 2007, or slight modifications thereof 6-9 that includes two major components: 1. The relative risk attributable to a given SNP, commonly denoted as λ* = p pr2 + qr1 ð Þ 2 + q pr1 + q ð Þ 2 p 2 r 2 + 2pqr 1 + q 2 ð Þ 2 , where p is the population frequency of the minor allele, q = 1−p, and r 1 and r 2 are the relative risks for heterozygotes and rare homozygotes relative to common homozygotes, and 2. the overall familial relative risk estimated from epidemiological studies, commonly denoted as λ o . The share of the familial risk attributable to the SNP is then obtained as log λ * / λ o . If applied to multiple independent SNPs in a multiplicative model, the numerator consists of the sum of the log(λ*) across SNPs.
Although this approach seems straightforward, there are a number of issues that deserve critical discussion.
• First, the relative risk estimates typically come from the large GWAS that detected the SNPs, and will-if winner's curse is not appropriately addressed-typically be lower in independent samples, which may lead to substantial overestimation of the proportion of family history (FH) risk explained by the SNPs. • Second, if the contribution of multiple SNPs is not completely independent but these SNPs are in some linkage disequilibrium (LD), the numerator of the equation may be substantially overestimated. On the other hand, if too restrictive "inclusion criteria" are employed that would consider only totally independent SNPs, the complementary information that correlated SNPs may still convey would be lost, leading to potential underestimation of the numerator. • Third, the estimator for relative risk associated with FH in the denominator is commonly taken from pooled estimates from epidemiological studies and may differ from the relative risk in the populations used to derive the genetic risks. • Fourth, the implicit partitioning of the excess risk of family history into some proportion that is explained by known genetic variants and some proportion that is explained by yet to be identified genetic factors neglects the fact that a substantial proportion of familial aggregation may be due to other reasons, such as familial aggregation of environmental or lifestyle risk factors. • Fifth, carriers of common, low penetrance risk alleles are not restricted to persons with a FH. In fact, given the limited proportion of persons with a FH, the common, low penetrance risk alleles occur more often in persons without FH. These risk alleles are hence not restricted to familial risk, but in fact convey risk independent of FH.
In this article, we propose an alternative, straightforward "epidemiological approach," using well-established epidemiological methodology that is unaffected by these concerns. We will use colorectal cancer (CRC), the third most common cancer globally, 10 whose heritability was estimated to be 35%, 11 as an example to demonstrate our approach. Next to others, FH has been identified as a major CRC risk factor. 12 In the past decade, more and more SNPs associated with CRC risk have been discovered by GWAS (e.g. [13][14][15][16]. Risk increases associated with single SNPs are mostly very small, with odds ratios (OR) for the risk alleles ranging between 1 and 1.1, but polygenetic scores based on multiple SNPs were shown to be highly predictive of CRC risk. [17][18][19][20] Methods

Alternative epidemiological approach
In the proposed approach both genomic data and familial risk are derived from the same data set. The underlying concept is graphically depicted in Supporting Information Figure S1. The concept reflects the causal role of both genetic and environmental factors for CRC risk. The association of FH with CRC risk reflects clustering of both types of factors within families. Based on this model, we propose to estimate the proportion of the CRC risk that is associated with having a FH (which is most commonly defined as FH in a first-degree relative) that can be explained by common genetic variants by where RR a is the relative risk (RR) for FH that is adjusted for common genetic variants, RR b is the RR for FH that is not adjusted for common genetic variants and both RRs are adjusted for environmental factors affecting CRC risk. Should the genetic variants not explain any of the FH-associated risk, RR a would equal RR b , in which case Prop(SNPs) would be 0. Should the genetic variants completely (100%) explain the excess CRC risk for a FH beyond the excess risk explained by environmental factors, RR a would equal 1. In practice, one would expect RR a to be between 1.0 and RR b and Prop(SNPs) What's new? Today's methods to evaluate excess familial risk of cancer explained by associated known common genetic variants may lead to strongly divergent and possibly exaggerated estimates. This paper presents an alternative, more straightforward approach using well-established epidemiological methodology. Application in a large population-based case-control study in colorectal cancer supports suggestions that this proportion may be substantially smaller than previously assumed and highly dependent on SNP pruning methods, the assumed risk for having a family history of CRC, and the number of identified SNPs. Rather than reflecting a major subcomponent of familial risk, common genetic variants appear to reflect substantial complementary risk.
to be between 0 and 1 (100%). Analogous calculations could be made using the log(RRs) rather than the RRs of both unadjusted and adjusted FH risk estimates (results are presented in the supplement). In case-control studies, RRs are commonly approximated by ORs.

Study population
Data for the current analyses were taken from the DACHS study (Darmkrebs: Chancen der Verhütung durch Screening), which has been described in detail elsewhere. 21

Data collection
Standardized in-person interviews were conducted with both cases (typically during their hospital stay) and controls (at their homes) by trained interviewers. Detailed information about the participants' family history and a variety of other risk and preventive factors was collected, and blood or buccal samples were taken. All CRC cases were histologically confirmed.
Genotyping DNA was extracted from blood samples (in 99.1% of participants) or from buccal cells (in 0.9% of participants) using conventional methods. Details about genotyping and imputation are provided in Supporting Information Table S1.  4 FH in at least one FDR. 5 Classification of GRS: very low, ≤10th percentile; low, 10th-20th percentile; low-medium, 20th-40th percentile; Medium, 40th-60th percentile; medium-high, 60th-80th percentile; high, 80th-90th percentile; very high, >90th percentile; GRS generated with weighted risk alleles (weights equaling the beta-coefficient as found in the respective discovery study) of SNPs not in LD (cutoff 0.95). Abbreviations: FDR, first-degree relative; FH, family history; SDR, second-degree relative.

Identification and selection of SNPs for the genetic risk score
A literature review was conducted to find SNPs that were found to be associated with a higher risk for persons of European descent as reported in detail elsewhere. 20 Of 105 identified SNPs (Supporting Information Table S2), 97 could be reliably measured or imputed. For six SNPs, the risk allele in our sample was not the same as reported in the respective discovery study (rs1957636, rs72647484, rs7259371, rs2696839, rs11884596, rs2516420). The correlation between the SNPs measured with the correlation coefficient from the "genetics" package in R is depicted in Supporting Information Figure S2.

Statistical analyses
A genetic risk score (GRS) was derived with various approaches. It was both calculated unweighted, as mere sum of risk alleles, or as weighted sum of risk alleles, with weights equal to the log of the per-risk-allele-OR as reported in the discovery study. We further applied various LD thresholds (no LD, D 0 ≥ 0.95, D 0 ≥ 0.5, D 0 ≥ 0.1, or max. one SNP per locus) for inclusion of SNPs in the GRS (only including the most significant SNP of all SNPs in LD in logistic regression models within our sample), which resulted in different numbers of included SNPs (97, 90, 80, 71 and 59, respectively). Additionally, both continuous GRS and categorized GRS were analyzed, with GRS categories defined by percentiles of the GRS distribution in controls: 0-10, 10-20, 20-40, 40-60, 60-80, 80-90, 90-100.
Proportions of FH risk explained by common genetic variants were estimated according to 4 traditional approaches, using the formulas given by the respective paper, for 4 different ORs for FH, for 5 different inclusion criteria for common genetic variants under observation and both with relative risk estimates and risk allele frequencies for the genetic variants from the discovery studies and with relative risk estimates and risk allele frequencies from the DACHS study.  Estimates based on odds ratios and allele frequencies as observed in the DACHS study. 2 In the DACHS study, 97 out of 105 SNPs could be analyzed (see Supporting Information Table S2). 3 FH estimate obtained in the DACHS study.
Next, we employed the newly suggested epidemiological approach and estimated the proportion of FH risk explained by common genetic variants by Prop SNPs where both ORs were estimated from multiple logistic regression models as outlined earlier. Different methods of handling the genetic information (GRS vs. separate variables) and, again, different LD thresholds were used for calculating the proportion of explained familial risk. Confidence intervals were computed using bootstrapping methods (n = 1,000).

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. Table 1 shows some main characteristics of the study population used for this analysis. Sex and age were equally distributed among cases and controls, reflecting matching. FH in a first-degree relative was more common in cases than controls (13.7% vs. 10.0%, p-value <0.0001), and a higher proportion of cases had a higher GRS compared to controls (p-value <0.0001). Figure 1 depicts the distribution of the GRS among cases and controls with relation to FH. A major difference in the distribution of the GRS was seen between cases and controls, but not between participants with and without FH, neither in cases nor in controls. Traditional estimates of the proportion of FH-associated risk that can be statistically explained by common genetic variants are shown in Table 2. The estimated proportions ranged widely, depending on the combination of the employed criteria (LD, assumed OR for FH, estimation approach). Estimates of the proportion explained by the genetic variants were generally much lower when they were based on the relative risk estimates and risk allele frequencies from the DACHS study rather than the discovery studies, but substantial variation was observed even within both groups of estimates, ranging from 9.6% to 23.1% and from 14.2% to 42.5%, respectively.

Results
In the DACHS study, the OR for FH derived from a model adjusted for the matching variables sex and age and a number of environmental factors was 1.77 (95% confidence interval (CI) 1.52 to 2.07), and this OR estimate was only slightly altered by additional adjustment for the GRS, regardless of how restrictive inclusion of SNPs was with respect to LD ( Table 3). The proportion of FH associated risk explained by common genetic variants could be quantified as between 6.4% (95% CI: 1.5 to 12.3%) and 8.9% (95% CI: 4.0 to 14.8%) by the epidemiological approach. Very similar results were obtained in sensitivity analyses using the log(ORs) instead of the ORs (Supporting Information Table S3).
In the application of our alternative approach, we considered a number of design options, such as (i) adjustment for all common genetic variants as single variables or (ii) adjustment for common genetic variants through one variable, i.e. a GRS. We calculated the FH-associated risk proportion explained by common genetic variants with both approaches, and furthermore with a large variety of GRS model options and they all consistently yielded estimates in a low range between 5% and 14% ( Table 4, Supporting Information Table S3).

Discussion
Our analyses support suggestions from theoretical considerations that traditional approaches commonly employed in GWAS might overestimate the proportions of familial risk that can be explained by common genetic variants, due to the underlying assumptions. Furthermore, estimates obtained by these approaches strongly varied even with relatively minor variations in the inclusion criteria of SNPs. By contrast, our proposed epidemiological approach yielded much lower, but consistent and robust estimates of the familial proportion explained by known common genetic variants. Participants with a FH of CRC in a second-degree relative were excluded for this analysis (n = 588).
2 95% confidence intervals for explained familial risk was calculated using bootstrapping methods (n = 1,000). 3 Regression model 1 adjusted for sex, age, education, smoking, hormone replacement therapy among women, BMI and previous colonoscopy. Regression models 2-6 adjusted for same variables as model 1 plus additionally adjusted for genetic risk score: 4 continuous, weighted, no LD threshold. 5  Traditional approaches of calculating the explained proportion of familial risk by common genetic variants have been widely employed in the GWAS literature. 3,5,15,[23][24][25][26] Manuscripts reporting on the newly identified susceptibility loci commonly estimate the incremental and total proportion of familial risk explained by the respective SNPs. For example, so far published estimates for the familial risk for CRC allegedly explained by common genetic variants ranged from~6% for 10 SNPs, 3 tõ 8% for 20 SNPs under observation, 15 to~12% for 76 SNPs, 5 to~22% 6 for 45 SNPs in a simulation study. Most studies took some LD measure into account, 5,6 while others simply added up contributions of SNPs without LD-pruning. 15 However, as shown by our analyses, such estimates may be quite sensitive to the underlying assumptions. In our example, the estimates obtained by traditional approaches had a wide range. In particular, the following patterns were observed: First, all estimates were substantially higher when the relative risk estimates and the risk allele frequencies of the discovery study rather than those of the DACHS study were employed, which might to some degree reflect the well-known phenomenon of "winner's curse." 27 The obtained relative risk estimates and the risk allele frequencies are subject to variation within populations, which might contribute to differing explained risk proportions.
Second, all of the traditional estimates were strongly dependent on the inclusion criteria of SNPs, being approximately 50% higher for the least restrictive approach (including all 97 SNPs) compared to the most restrictive approach (59 SNPs). Using a less restrictive cutoff might lead to overestimation of the explained familial risk by ignoring the redundancy of the risk information conveyed by SNPs in high linkage disequilibrium.
Third, the traditional estimates strongly depend on the assumed risk increase by family history, a direct consequence of the underlying equations, and may be biased if the risk associated with family history in the analyzed study differs from the assumed familial risk.
Although our suggested alternative approach is less susceptible to violations of these assumptions, possible disadvantages of the "epidemiological approach" also have to be kept in mind. Foremost, it requires availability of epidemiological data, especially information on FH of CRC and possible confounding risk factors. For our approach the risk estimate of having a FH of CRC needs to be obtained from the same data set from which the genetic data are analyzed. Many large genomic data sets might lack this kind of information, but could nevertheless conduct our proposed approach in subsets for which FH information is available. Furthermore, some of the considerations that are mentioned earlier might also apply for our approach, such as the issue of winner's curse if the cohort under investigation was included in deriving the GRS without appropriate correction for overoptimism or the appropriate threshold of the linkage disequilibrium. However, estimates of the familial risk explained by known genetic variants were found to be much Note: 95% confidence intervals of explained familial risk was calculated using bootstrapping methods (n = 1,000). 1 Categories defined by percentiles of distribution in controls: 0-10, 10-20, 20-40, 40-60, 60-80, 80-90, 90-100. 2 Weights equal to log(OR) from discovery study. Abbreviations: GRS, genetic risk score; OR, odds ratio.
less affected by choice of the threshold for the linkage disequilibrium with the proposed epidemiological approach than with the traditional approaches (range for epidemiological approach 5.4-14.3% vs. range for traditional approaches 9.6-23.1%). It has to be kept in mind that our analysis, like most previous analyses that estimated explained familial risk, was restricted to common variants. In principle, however, analogous approaches could also be applied in analyses additionally encompassing rare, highly penetrant germline mutations.
A specific strength of our study was the availability and use of comprehensive genetic and environmental data from the DACHS study, one of the largest case-control studies on CRC, with an unselected population-based multicenter recruitment.
In summary, we suggest an alternative, straightforward approach based on standard epidemiological methodology to estimate the proportion of FH-associated risk of cancer that is attributable to common genetic risk variants. Application of this approach in a large population-based case-control study on CRC supports suggestions that this proportion may be substantially smaller than previously assumed and highly dependent on SNP pruning methods, the assumed risk for having a FH of CRC and the number of identified SNPs. On the other hand, relevant prevalences of those variants in both people with and without FH imply that, for the example of CRC, polygenetic risk scores based on meanwhile identified common genetic variants explain quite a substantial proportion of overall risk far beyond FH in the total population. Rather than reflecting a major subcomponent of familial risk, common genetic variants appear to reflect substantial complementary risk. Vice versa, rather than exclusively or even primarily reflecting genetic factors, familial aggregation of risks may also reflect to a large extent other components, such as shared environmental or lifestyle factors. 28 While the alternative epidemiological approach was illustrated using CRC as an example, it should be equally applicable to other cancer outcomes, and, in fact, any other disease outcome.