A polygenic biomarker to identify patients with severe hypercholesterolemia of polygenic origin

Abstract Background Severe hypercholesterolemia (HC, LDL‐C > 4.9 mmol/L) affects over 30 million people worldwide. In this study, we validated a new polygenic risk score (PRS) for LDL‐C. Methods Summary statistics from the Global Lipid Genome Consortium and genotype data from two large populations were used. Results A 36‐SNP PRS was generated using data for 2,197 white Americans. In a replication cohort of 4,787 Finns, the PRS was strongly associated with the LDL‐C trait and explained 8% of its variability (p = 10–41). After risk categorization, the risk of having HC was higher in the high‐ versus low‐risk group (RR = 4.17, p < 1 × 10−7). Compared to a 12‐SNP LDL‐C raising score (currently used in the United Kingdom), the PRS explained more LDL‐C variability (8% vs. 6%). Among Finns with severe HC, 53% (66/124) versus 44% (55/124) were classified as high risk by the PRS and LDL‐C raising score, respectively. Moreover, 54% of individuals with severe HC defined as low risk by the LDL‐C raising score were reclassified to intermediate or high risk by the new PRS. Conclusion The new PRS has a better predictive role in identifying HC of polygenic origin compared to the currently available method and can better stratify patients into diagnostic and therapeutic algorithms.


| INTRODUCTION
Hypercholesterolemia is one of the most common conditions encountered in medical practice, as well as a known and, most crucially, modifiable cardiovascular risk factor. Severe hypercholesterolemia (HC) is defined as low-density lipoprotein cholesterol (LDL-C)> 4.9 mmol/L (>190 mg/dl) and is estimated to affect 14-35 million people worldwide (Sniderman, Tsimikas, & Fazio, 2014). Familial hypercholesterolemia (FH) is the most common cause of severe HC, with a prevalence of 1 in 250 individuals (Nordestgaard et al., 2013), affecting approximately 10 million individuals worldwide. If untreated, FH is associated with a 20-fold increase in premature cardiovascular disease (CVD), with coronary events occurring in approximately 30% of women before the age of 60 years, and 50% of men by the age of 50 years (Nordestgaard et al., 2013). A monogenic origin of FH is confirmed in only 40% of patients with a clinical diagnosis of FH (Sharifi, Futema, Futema, Nair, & Humphries, 2017). In more than 90% of these genetically confirmed patients, a pathogenic heterozygous dominant mutation in the LDL receptor gene (LDLR) is detected, with recessive mutations in APOB and PCSK9 present in the remainder (Berberich & Hegele, 2019). In 2013, Talmud et al. developed a weighted 12-single nucleotide polymorphism (SNP) LDL-C raising score validated in a white British population and suggested that in > 50% of patients with a clinical diagnosis of FH and negative genetic testing the origin of HC may be polygenic (Futema et al., 2015;Talmud et al., 2013). However, the utility of this score in clinical practice remains to be established and it is currently not incorporated in the NICE guidelines. In the last decade, several other LDL-C polygenic scores have been proposed (Dron & Hegele, 2018). However, the major limitation of these studies is that, in the majority of cases, these association scores were developed in a specific population of individuals but results not replicated in a validation cohort.
The clinical management of patients with severe HC remains aggressive lipid lowering treatment guided by the patient's clinical history (Catapano et al., 2016;Sniderman et al., 2014). However, the finding that severe HC in a large percentage of patients meeting the clinical criteria for FH may be of polygenic rather than monogenic origin opens new questions on whether polygenic HC is a different phenotype compared to monogenic FH, thus, requiring different disease risk stratification algorithms for affected patients and their blood-related family members. In order to answer these questions, it is mandatory to generate a polygenic biomarker with good accuracy and replicability in identifying patients with severe HC of polygenic origin. Moreover, such a polygenic marker would help stratify those patients in which DNA analysis reveals HC of neither monogenic nor polygenic origin. In such patients new, yet unidentified genes responsible for FH could be present (Futema, Bourbon, Williams, & Humphries, 2018).
Polygenic risk scores (PRSs) have gained wide interest in recent years as they may help deliver personalized medicine. PRSs have been used to identify patients at risk of several conditions, including cardiovascular (Inouye et al., 2018)  The primary aim of this study was to develop an improved polygenic biomarker by generating an LDL-C PRS. The score was obtained on a target cohort of white Americans using SNP summary data from the Global Lipid Genetics Consortium (GLGC), followed by validation on a second cohort of European Finnish individuals. We also compared the performance of the new PRS against the 12-SNP LDL-C raising score by Talmud et al. (2013) (which is currently used in the UK clinical setting) with a focus on reclassifying individuals with severe HC, who were deemed to be at a low risk of HC of polygenic origin.

| Ethical compliance
All participants to NFBC gave written informed consents and the Ethics Committee of Northern Ostrobothnia hospital district and the University of Oulu (Finland) approved the study. Protocols for the eMERGE network were approved by the Institutional Review Boards (IRBs) at the institutions where participants were recruited; all included participants provided written informed consent prior to inclusion in the study.

| Populations
A study cohort of 2,764 white American individuals was obtained from The Electronic Medical Records and Genomics network (eMERGE, dbGaP Study Accession: phs000360. v3.p1) (McCarty et al., 2011). A replication cohort of 5,402 Finnish individuals was retrieved from the Northern Finland Birth Cohort 1966 (NFBC1966) (Järvelin et al., 2004;Sabatti et al., 2009). After data quality checking and genotype data imputation Haplotype Reference Consortium panel (McCarthy et al., 2016), the cohorts comprised 2,197 white Americans and 4,787 Finnish individuals of 39,131,578 genotyped and imputed SNPs (See details of data preprocessing in Supplementary Methods). Biochemical data were available for all subjects.

| Construction of the PRS
The PRS of an individual j was defined by the weighted sum of LDL-C raising alleles and depends on the set of n SNPs, the estimated SNP effect sizes (beta coefficients, β i ) and the allele dosage carried by the individual (x i,j ) according to the formula: The PRSice (Euesden, Lewis, & O'Reilly, 2015) algorithm was implemented as follows. First, genome-wide summary statistics for SNPs associated with the LDL-C trait (p < 1 × 10 −3 ) were retrieved from the Global Lipid Genetics Consortium (GLGC). This initial set of SNPs was reduced by performing linkage disequilibrium (LD) pruning, thus retaining only the most significant SNPs in each LD block. Different LD thresholds (r 2 between 0.1 and 0.8) were tested (detailed in Figure S1).
Sets of SNPs were defined over a range of p-value thresholds (1 × 10 −3 -1 × 10 −100 ) and evaluated by PRSice to identify the best PRS, that is, the one that maximizes the explained phenotypic variance in the white American cohort. At each p-value threshold, the PRS was incorporated in a linear regression model to explain the LDL-C continuous trait, while adjusting for the following covariates: age, gender, BMI, and ancestry differences captured by the first two components from multidimensional scaling. From each regression model, an incremental R 2 was computed by PRSice and plotted against the p-value threshold. This R 2 is reported as the difference between the R 2 of the full regression model (LDL-C∼PRS + covariates) and the R 2 of the null model (LDL-C∼covariates). The best PRS was the one achieving the highest R 2 .

| Performance assessment & statistical analysis
The PRS was assessed using the following statistical approaches: • Model fit: A multiple linear regression model for the LDL-C continuous trait was fitted and the R 2 of the models compared. These R 2 values were calculated following the same approach described for PRSice (Euesden et al., 2015) (see Supplementary Material). • Area under the curve (AUC): The phenotype was categorized in severe HC (LDL-C > 4.9 mmol/L), intermediate HC (3.0 ≤ LDL-C ≤ 4.9 mmol/L), and normal LDL-C levels (LDL-C < 3.0 mmol/L), and the classification accuracy of the scores was assessed by receiver operating characteristic (ROC) curves. We used the DeLong test to compare AUCs from different PRSs. • The PRS was categorized using the deciles of the distribution: low-risk (decile third and below), intermediate risk (deciles fourth, fifth, and sixth), high-risk category (deciles seventh and above). This mirrors the score categorization used in the SNP LDL-C raising score by Talmud et al. (2013), thus allowing comparison between methods. Afterwards, the difference in median LDL-C levels across different PRS categories was tested using the Wilcoxon test. The risk ratio of having abnormal LDL-C was calculated for the high-risk category (relative to the low-risk category) (see Supplementary Methods and Table S1). The same cutoffs for risk categorization identified in the American cohort were then applied to the Finnish cohort. • The distribution of subjects with severe HC was analyzed across PRS categories. We compared the percentage of patients with severe HC in the low-risk PRS category.
The list of genes (gene name, OMIM (MIM), and GenBank (RefSeq) identifiers) included in the PRS is presented in Table S2.

| RESULTS
A cohort of 2,197 white American individuals was used to construct the PRS for the LDL-C trait. The clinical and biochemical characteristics of the individuals included in the study are presented in Table 1 and Table S3.
A total of 8,224 SNPs (LDL-C trait association p < 1 × 10 −3 ) were analyzed by the PRSice algorithm. In a regression model (see Figure S2), the optimal PRS included 36 SNPs. The LD threshold of r 2 < 0.1 provided the best model fit ( Figure S1 and S2). All 36 SNPs had a reported association with the LDL-C trait with a p < 1 × 10 −20 in the GLGC, and are presented in Table 2. This novel score explained 8% of the trait variability (p = 10 -41 ) in a multiple regression analysis, which adjusts for covariates.

Finland Birth Cohort (NFBC)
The new PRS was applied to a cohort of 4,787 Finnish individuals from the Northern Finland Birth Cohort 1966 (NFBC), whose clinical and biochemical data are presented in Table 1 and Table S3. In this replication cohort, our score explained 8% of LDL-C variability. Moreover, after score categorization, the difference in median LDL-C levels, as well as the risk of having severe HC, were significantly higher in the high-versus low-genetic risk category (median LDL-C: 3.2 mmol/L vs. 2.6 mmol/L, p = 10 -63 ; RR = 4.8 (CI: 2.6-8.9), p = 10 -7 ).

| Comparison with the currently available method (SNP LDL-C raising score)
We compared the results obtained with our new PRS to those obtained using the SNP LDL-C raising score of Talmud et al. (2013) estimated in a white British cohort. Our PRS was more accurate compared to the SNP LDL-C raising score in both the American (AUC 0.65 versus. 0.61, p = .12 DeLong test) and Finnish populations (AUC = 0.67 vs. 0.65, p = .36, DeLong test) and was able to explain 30% more of trait variance (8% vs. 6% in the American population and 8% vs. 6% in the Finnish population).
Afterwards, the categorized PRS and SNP LDL-C raising score were compared. In the white American cohort, 45% (230/506) versus 42% (213/506) of individuals with normal level LDL-C (<3 mmo/L) were classified in the low-risk category, and 50% (53/107) versus. 46% (49/107) individuals with severe HC were classified in the high-risk category using the PRS and SNP raising score, respectively.
In the Finnish cohort, which is a younger cohort compared to the Americans (mean age 31 yrs, SD 0.2 versus. 60 yrs, SD 11.5 in the Americans), with a healthier lipid profile (57% versus 23% individuals with LDL-C levels below 3 mmol/L, Table S3), 32% (877/2733) of individuals with normal LDL-C levels were classified as low risk by both methods. However, 53% (66/124) versus 44% (55/124) of individuals with severe HC were included in the high-risk category using the PRS and SNP raising score, respectively (p = .16), thus confirming a trend toward a better performance of PRS versus SNP raising score in the replication cohort.
When results were analyzed by a subject-to-subject comparison, the two methods showed concordance in 37% of cases with severe HC in the American cohort (34 individuals classified in the high-risk category by both methods and 6 individuals classified as low-risk by both methods) and 42% of severe HC cases in the Finnish cohort (44 individuals classified as high risk by both methods and eight individuals classified as low risk by both methods).   (Sherry et al., 2001). The chromosome harboring the SNP and the SNP alleles (A and B) are shown. Chromosome location is according to GRCh38. Allele B is the LDL-C rising allele. Effect size and p-value for each SNP are according to GLGC. SO, is the Sequence Ontology term from Ensembl (Zerbino et al., 2018). Q1 indicates whether the SNP is used in the 12-SNPs raising score by Talmud et al. (2013). Table S2. testing to identify novel genetic causes of HC. In our study, 11% (26) individuals with severe HC (13 in the American and 13 in the Finnish population) were classified as low risk of HC by the SNP LDL-C raising score. However, when using the new PRS, 46% of these individuals (7/13 in American and 5/13 in Finnish population) were reclassified to either the intermediate (11) or high-risk category (1 individual), suggesting that the genetic makeup of these cases can, at least in part, explain their severe HC. Of note, none of the patients classified as low risk by PRS were classified as high risk by the SNP LDL-C raising score. Six subjects with HC classified as intermediate risk of polygenic origin by the SNP LDL-C raising score, were classified as low risk (third decile) by PRS.

| Functional annotations and pleiotropy analysis of genes in the new LDL-C PRS
The 36 SNPs in the new PRS map to 23 genes (Table S2), six of which (PCSK9, APOB, LDLR, APOE, CELSR2, and ABCG8) are also present in the SNP LDL-C raising score (see Table 2). These 23 genes show a significant enrichment in Gene Ontology (GO) terms related to cholesterol homeostasis (p 1x10 -4 ) and lipoprotein processes (p = 7 × 10 −4 ). No enrichment was found in KEGG metabolic pathways annotation, which suggests that multiple metabolic pathways may be implicated in the development of severe HC of polygenic origin.
In the new PRS, we did not include the two SNP (rs7412 and rs429358), which define the APOE ε2, ε3, and ε4 haplotype. However, these two SNP are in LD (LD > 0.35 between rs7412 and rs7254892 in TOMM40 and between rs429358 and rs4420638 in APOC4) in both American and Finnish populations (see Figure S3).

| DISCUSSION
We constructed a new PRS for LDL-C from an initial set of thousands of SNPs at GWAS p-value threshold < 10 -3 and demonstrated that it is robustly associated with the LDL-C trait in two independent populations of white European ancestry from the United States of America and Finland. Compared to the existing 12-SNP LDL-C raising score (currently used in the setting of FH genetic testing by genetic laboratories in England, UK), the new PRS was able to explain 30% more of LDL-C trait variability and to identify a polygenic risk component, therefore, reclassifying several patients otherwise deemed as low risk for hypercholesterolemia of polygenic origin. The PRS for LDL-C can have several applications in clinical practice including i) identifying patients with a clinical diagnosis of FH who are at high likelihood of HC of polygenic origin, and ii) inclusion into algorithms for the early stratification of patients at risk of HC and other comorbidities, both cardiovascular and Alzheimer's disease, for early life style modifications. The SNP LDL-C raising score by Talmud et al. (2013), and the work published by the same group in patients with a clinical diagnosis of possible FH (Sharifi, Futema, Nair, & Humphries, 2019;Sharifi, Higginson, et al., 2017), suggests that HC of polygenic origin could be a new phenotype, distinct from monogenic FH, thus requiring a different clinical approach. However, the most informative SNP set should be used to identify FH patients without a confirmed monogenic diagnosis (FH/M-) patients, and distinguish those with polygenic HC from those in which the genetic background of HC remains unexplained (10%-15% of cases) (Sharifi, Futema, et al., 2017). The same research group who developed the SNP LDL-C raising score (which consists of 12 SNPs and is currently used by the Bristol genetics laboratories in the UK in the genetic screening of patients with a clinical diagnosis of FH), attempted to improve their score by manually choosing and adding 21 additional SNPs associated with the LDL-C trait to their original 12-SNP score. However, this did not result in a better diagnostic performance (Futema et al., 2015).
Although PRSs can be built using a small number (typically < 100) of SNPs at GWAS significant level (pvalue < 5 × 10 −8 ), the field is now migrating toward the use of genome-wide polygenic scores consisting of thousands of SNPs with higher p-values (Goldstein, Yang, Salfati, & Assimes, 2015). These mega polygenic scores have the potential of being more informative compared to small ones (Natarajan et al., 2018); however, this comes at the cost of intensive computational analysis and no genome-wide polygenic score is currently available for LDL-C in clinical practice.
In this study, we used an unbiased method for selecting the most informative SNPs associated with LDL-C from an initial set of over 8,000 SNPs, to construct a polygenic score for LDL-C. Although the AUC for PRS was only marginally better compared to that of LDL-C SNP score, the PRS was better in classifying patients into low or high risk compared to the LDL-C SNP score method. However, this was just a trend possibly because of the small number of patients with severe HC in our two cohorts.
The 23 genes harboring the 36 SNPs selected for the PRS show enrichment in Gene Ontology terms related to lipid metabolism, thus further confirming the validity of the selection process. Pathways analysis did not show any enrichment, which suggests that small defects in multiple metabolic pathways may be involved in hypercholesterolemia of polygenic origin.
We are still far from understanding the genetic causes of FH, a condition associated with a 20-fold increased risk of CHD compared to the general population (Nordestgaard et al., 2013). FH has an estimated prevalence of 1 in 250 individuals. In approximately 40% of FH patients an inherited pathogenic DNA point mutation (Sharifi, Futema, et al., 2017) (monogenic FH, FH/M+) is present, whereas in 50% of cases HC is deemed to be of polygenic origin. In the remaining 10% of cases, FH is of unknown origin. Mutations in yet unknown gene/s could be present in these patients with FH of unknown origin and pose a novel drug target for severe HC. Narrowing down the number of individuals with primary HC and no known pathogenetic cause (HC of monogenic or polygenic origin) is crucial for studies aimed at understanding the pathogenesis of HC. New improved scores, such as ours, that include novel LDL-C SNPs can help identify, and hence reclassify, patients in whom HC of a polygenic origin is present, thus improving diagnostic and therapeutic algorithms.
We found that many of the genes included in the PRS have pleiotropic effects. We and others have noted that gene pleiotropy is common in genes implicated in both rare (Ittisoponpisan, Alhuzimi, Sternberg, & David, 2017) and common disorders (Price, Spencer, & Donnelly, 2015). In this study, 47.8% of the genes in the PRS were also associated with conditions, such as Alzheimer's, CHD, and diabetes in GWASs. Indeed, there is a well-known association between HC and Alzheimer's (Park et al., 2013) or CHD  disease. The PRS could, thus, be of help in identifying HC patients who are at risk of developing comorbidities, thus contributing toward achieving personalized medicine.
An important limitation of our study is the small number of patients with severe HC in the two cohorts. Future work will involve applying the PRS in FH patient cohorts and, in particular, to FH/M-patients. Moreover, it will be important to evaluate the correlation between polygenic risk for HC, as defined by the PRS, to the risk of cardiovascular events.
In conclusion, we developed a polygenic biomarker based on 36 SNPs that is able to identify patients at an increased risk of HC and associated comorbidities as a result of their genetic makeup.