Prediction of colorectal cancer risk based on profiling with common genetic variants

Increasing numbers of common genetic variants associated with colorectal cancer (CRC) have been identified. Our study aimed to determine whether risk prediction based on common genetic variants might enable stratification for CRC risk. Meta‐analysis of 11 genome‐wide association studies comprising 16 871 cases and 26 328 controls was performed to capture CRC susceptibility variants. Genetic prediction models with several candidate polygenic risk scores (PRSs) were generated from Scottish CRC case‐control studies (6478 cases and 11 043 controls) and the score with the best performance was then tested in UK Biobank (UKBB) (4800 cases and 20 287 controls). A weighted PRS of 116 CRC single nucleotide polymorphisms (wPRS116) was found with the best predictive performance, reporting a c‐statistics of 0.60 and an odds ratio (OR) of 1.46 (95% confidence interval [CI] = 1.41‐1.50, per SD increase) in Scottish data set. The predictive performance of this wPRS116 was consistently validated in UKBB data set with c‐statistics of 0.61 and an OR of 1.49 (95% CI = 1.44‐1.54, per SD increase). Modeling the levels of PRS with age and sex in the general UK population shows that employing genetic risk profiling can achieve a moderate degree of risk discrimination that could be helpful to identify a subpopulation with higher CRC risk due to genetic susceptibility.

Incorporating more complete genetic information is expected to improve risk stratification and the combined effect of multiple risk loci has the potential to achieve a degree of risk discrimination that is useful for CRC risk stratification.
In our study, we aimed to derive, optimize and test PRSs for prediction of CRC and to apply the PRSs with the best predictive performance in population settings for risk stratification. We developed models by incorporating genetic information of CRC and several markers that comprise potential CRC risk factors or complex traits cooccurring with CRC. To gauge the broader future potential of genetic risk modeling, we assessed the utility of genetic risk scores in categorizing risk subgroups within the general population by projecting the risk models to the UK population.

| Studies
We made use of 11 previously published genome-wide association studies (GWASs) (ie, CCRR1, 6 CCFR2, 7 COIN, 8 CORSA, 9 Croatia, 10 DACHS, 11 FIN, 12 NSCCG-OncoArray, 13 SCOT, 14 UK1 15 and VQ58 16 ) to generate a list of genetic variants associated with CRC risk. A series of Scottish CRC case-control studies were used to test the predictive performance of PRSs. The developed PRSs were further evaluated in an independent test data set from UK Biobank (UKBB). Schematic representation of the study design is shown in Figure S1. Standard quality control (QC) measures were applied to each of the data sets. After QC process, a total of 16  We performed a meta-GWAS of 11 studies to obtain a list of genome-wide significant SNPs (P < 5 × 10 −8 ) and their per-allele odds ratios (ORs) and SEs for CRC risk. The meta-analysis SNPs were pruned to only those with an r 2 < 0.1 and a distance greater than 500 kb. For completeness, we also included the genetic risk variants reported in early published CRC GWASs (Table S2). A weighted genome-wide PRS (wPRS) was computed using both previously known susceptibility variants and independent variants identified by the meta-GWAS.

| Regional genetic scores
We additionally constructed regional genetic scores by including SNPs associated with CRC and its risk factors (

What's new
While common genetic variants influence colorectal cancer (CRC) risk, whether these variants can predict high, moderate, or low CRC risk remains uncertain. In this study, the predictive performance of a genome risk score was compared against a series of regional genetic scores for CRC, with scores developed and tested using data from genome-wide association studies, Scottish case-control studies, and the UK Biobank. A weighted genomic risk score, based on 116 different CRC susceptibility variants, exhibited superior performance over regional scores. The findings suggest that genetic risk assessment could be help identify subpopulations with elevated CRC risk linked to genetic susceptibility.

| Model development and evaluation
We constructed prediction models in the Scottish data set by incorporating genetic CRC risk in forms of either PRSs or regional genetic scores with adjustment for the first 10 genetic principal components (PCs). A sequence of logistic models was fitted for: (a) a weighted PRS of identified CRC GWAS SNPs; (b) regional genetic scores for CRC and (c) regional genetic scores for CRC and other relevant traits. A series of stepwise backward logistic regressions was conducted on regional genetic scores to obtain an optimized set of scores determined by the Akaike information criterion. The discriminatory accuracy of the models was evaluated by the area under the receiveroperating characteristic curve (ROC, known as c-statistic) with 10-fold cross-validation. These models were further assessed by the stratification of anatomic tumor sites (ie, proximal colon, distal colon and rectum). The PRS model with the best performance was then evaluated in UKBB. ORs were then derived for per SD increase in PRS for overall, and site-specific, CRC risk. To simplify the interpretation of PRS, we categorized it into percentiles based on its distribution in controls.

| Combined effect of PRS and family history
To evaluate the incremental contribution of combining PRS and family history for prediction, we additionally calculated the expected information for discrimination (expected weight of evidence, denoted as Λ). 18 Briefly, the expected information for discrimination is the expected log-likelihood ratio in favor of correct assignment as case or control, taken as the average of the values in cases and controls. One advantage of using Λ is that the contributions of independent variables to predictive performance are additive on the scale of Λ. For a logistic regression model, the sampling distribution of Λ is asymptotically Gaussian. In this situation, the c-statistic can be viewed as a mapping of Λ, which takes values from 0 to infinity to the interval from 0.5 to 1. 18 The rationale and theoretical explanations are presented in Supporting Information Methods. Family history of CRC was considered as a categorical variable, dependent on the presence or absence of at least one first-degree relative affected by CRC at any age at the time of recruitment.

| Estimation of absolute risk for developing CRC
The absolute risk of CRC for individuals in each risk category was calculated after accounting for competing risks of dying from causes other than CRC by using the formula described previously. 19 Specifically, we obtained sex-and age-dependent UK CRC incidence and mortality rates for 2016 midyear from the Office for National Statistics (http://www.ons.gov.uk/). The mortality rates for non-CRC causes were estimated by subtracting the age-and sex-specific CRC mortality rates from the overall mortality rates. Full details of these calculations are provided in Supporting Information Methods.

| RESULTS
The meta-analysis of 11 GWASs resulted in the identification of 1593 genetic variants associated with CRC at P < 5 × 10 −8 . After adding SNPs reported in other GWAS and excluding SNPs in LD, a list of 116 SNPs (Table S2) were retained for the creation of a weighted polygenic risk score (wPRS 116 ). We additionally created 35 regional genetic scores that included 1593 SNPs with P < 5 × 10 −8 (Table 1).
We also used more liberal P value thresholds and created 40 genetic scores comprising of 1837 SNPs at P < 10 −7 and 41 genetic scores comprising of 2712 SNPs at P < 10 −6 . The genes harbored in these genomic regions were annotated and are presented in Table S3. We additionally created 17 regional scores for CRP, 5 for VD, 85 for IBD, 69 for BMI and 48 for WHR with P value threshold setting as 5 × 10 −8 . More liberal P value thresholds (P < 10 −7 and P < 10 −6 ) were also applied for these traits, and the number of regional genetic scores created and SNPs included are present Table S4.
We set out to optimize these derived scores by examining their discriminative ability in the Scottish data set (Table S5) (Figure 1; Table S6). When considering CRC risk separately for proximal colon, distal colon and rectum, it showed no improvement in predictive performance. We then explored the modification effect of the wPRS 116 by sex, age or family history, but found no evidence of an interaction effect (Table S7, P interaction = .426 for multiplicative interaction with sex, P interaction = .688 with age, P interaction = .388 with family history), therefore we did not fit additional interaction terms in the model.

| DISCUSSION
In our study, we describe a systematic approach to derive, validate and test a number of candidate genetic risk scores with incorporating information from hundreds to thousands of common genetic variants to predict polygenic susceptibility of CRC. We evaluated the predictive performance of both a genomic risk score and a series of regional genetic scores that were built based on the summary statistics from multiple GWASs. Our study shows that a weighted genomic risk score including 116 CRC susceptibility SNPs is the score with the best performance, while deconstructing genetic risk into multiple regional scores or inclusion of additional SNPs above the genome-wide  With the expectation of improving the predictive power of common genetic variants, we additionally derived a set of SNPs associated with CRC risk with liberal P value thresholds to allow the contribution of signals from additional susceptibility SNPs that have not been F I G U R E 1 Odds ratios and 95% CIs for associations between the percentiles of wPRS116 and site-specific CRC risk in UKBB. CI, confidence interval; CRC, colorectal cancer; UKBB, UK Biobank; wPRS 116 , weighted polygenic risk score of 116 colorectal cancer single nucleotide polymorphisms discovered or validated in previous GWAS efforts. Any correlation between SNPs was addressed by creating LD-adjusted regional scores. However, with inclusion of thousands of SNPs, the predictive capacity did not improve but showed a lower c-statistic in the range of 0.58 to 0.59, which is probably due to the cost of adding noise from SNPs that were not truly associated with CRC. To assess if the genetic susceptibility of known risk factors of CRC would further contribute to CRC prediction, we developed prediction models, which incorporated genetic information of several known risk factors, but the c-statistic remained close to 0.60.
Most previous efforts mainly focused on the predictive ability of PRS to capture the overall risk of CRC. 4,5,[22][23][24] However, there is compelling evidence suggesting that genetic risk factors may differ by anatomic locations. 25 We therefore aimed to improve prediction of sitespecific CRC by deconstructing the commonly used genomic risk score into several regional scores, allowing susceptibility signals through multiple/different mechanisms to influence genetic predisposition to site-specific CRC. Although we treated proximal, distal and rectal cancer as distinct endpoints to generate the best set of regional scores respectively, their predictive performance still showed modest discriminative ability. This might be limited by the fact that the weights used for regional score calculation were derived from the coefficient estimates for overall CRC instead of site-specific ones.
An extrapolation to the UK population led to the conclusion that 10% of the general population will have a 10-years absolute risk approaching 5% after 65 years old on the basis of quantifiable genetic risk alone and who will merit intensive screening. A 5% threshold of absolute risk has clinical and public health impact since it exceeds the highest risk at any age in the general population and it is 10-fold greater than the risk of a 50-year old person who is eligible to enter the population-based screening programs. Additionally, the modeling shows individuals at different levels of the wPRS 116 will reach the same risk estimate at different ages, supporting the notion that using genetic profiling in combination with age will lead to more effective risk stratification.
In conclusion, we show that prediction of CRC risk based on profiling with common genetic variants presents a moderate discriminability. Although the contribution of wPRS 116 to individualized risk profiling is limited, employing genetic risk profiling can achieve a moderate degree of risk discrimination that is helpful to identify a population subset with high genetic risk.

ACKNOWLEDGMENTS
We are grateful to all who contribute to recruitment, data collection and data curation. We acknowledge that these studies would not be possible without the patients and controls and their families. We acknowledge the expert support on sample preparation from the