POLARIS: Polygenic LD‐adjusted risk score approach for set‐based analysis of GWAS data

ABSTRACT Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set‐based analyses by leveraging public genome‐wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set‐based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld‐Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD‐adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set‐based analysis. MAGMA is an alternative set‐based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD‐structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self‐contained test of association in the test dataset.

with a single phenotype, PRS aims to assess the genetic liability to some phenotype on the basis of the polygenic risk for the same or a different phenotype estimated from independent data.
Set-based analysis offers an attractive alternative to single SNP analyses, because the combined effect of SNPs within the set may be captured. Single SNP analyses are often underpowered due to the small effect sizes of individual SNPs, set-based analysis considers the combined effect of all SNPs within the set, which may have a larger combined effect size and hence higher power to detect association than any individual SNP. In addition, gene-based analysis, a gene-centered equivalent of set-based analysis, identifies genes associated with disease rather than a single SNP as a proxy for the gene. In gene-based analyses, genes are found to be fairly consistently associated with disease across different populations. In contrast, different SNPs in a set in linkage disequilibrium (LD) may be found to be associated with a disease in different samples. Gene-based analyses also directly provide information for functional analysis (Li, Gui, Kwan, & Sham, 2011). Set-based analysis can also be employed as a pathway analysis, and applied to sets of SNPs defined by epigenomics for different tissue/cell types.
There are a number of methods to assess the set-based effect by combining the effects of all SNPs within the set, including Fisher's method (Elston, 1991) for combining P values assuming independence between SNPs, Simes's method (Simes, 1986) for finding the smallest adjusted P-value, GATES extended Simes (Li et al., 2011) that incorporates functional information, rank/threshold truncated products of p methods (Dudbridge & Koeleman, 2003;Moskvina et al., 2009;Zaykin, Zhivotovsky, Westfall, & Weir, 2002), set-based analysis as implemented in PLINK (Chang et al., 2015;Purcell et al., 2007), Brown's method (Brown, 1975;Moskvina et al., 2011) that adjusts for the LD structure between SNPs, a logistic kernel-machine based test that accounts for nonlinear SNP effects (Wu et al., 2010), MAGMA (de Leeuw, Mooij, Heskes, & Posthuma, 2015) that uses a regression-based approach and Pascal (Lamparter, Marbach, Rueedi, Kutalik, & Bergmann, 2016) that utilizes the sum and maximum of chi-squared statistics to generate a set score. Each of these methods have advantages and limitations, however, the above methods do not incorporate the effect sizes from external data with individual genotype data. Set-based methods using individual SNP P values are also able to improve power by incorporating external data available from previous studies using metaanalysis. In the present paper, we focus on methods which use individual genotype data as this is a necessary requirement of PRS.
MAGMA v1.06 (de Leeuw et al., 2015) is a recent approach that has emerged as a widely used and computationally efficient set-based method. This regression-based approach accounts for LD between SNPs when individual genotype data are available. The matrix of SNPs within the set is decomposed into principal components (PCs), and PCs with small eigenvalues are removed. The remaining PCs are then used as uncorrelated predictors in regression against the phenotype of interest and an F-test is used to determine the strength of the association between the set and the phenotype, providing the MAGMA set-based P-value. The MAGMA programme can be used on individual genotypes using the Principal Component Analysis (PCA) method and also on summary statistics using Brown's method (Brown, 1975). It often happens that summary statistics are available from a large consortium, while in-house studies with individual genotypes have smaller sample sizes. In such situations, the options for using MAGMA are either applying the PCA method to the in-house genotype dataset, or to a meta-analysis of the summary statistics for both datasets.
PRS analysis can be considered as set-based analysis when a set includes all SNPs in the whole genome. PRSs provide a method for combining information from individual SNPs into a single measure of risk allele burden. In their most widely used form, PRS's have been applied to genome-wide SNP data where they can capture a useful fraction of genetic liability to polygenic traits. PRS's can also be used as genomewide predictors of affected status (Escott-Price et al., 2015;Purcell et al., 2009;Ripke et al., 2014). We reasoned that the basic principles of polygenic score analysis can also be applied to individual genes, or to gene-set analyses. The motivation for doing so is somewhat different than PRS analyses of genome-wide data; rather than predict case-control status or trait liability captured, our goal in applying the principles of PRS to genes and gene sets is to detect association to these potentially biologically informative features. As for genome-wide analyses, genes or gene-set PRS could be used to predict affected status, or to estimate the gene-or pathwayspecific SNP liability captured by GWAS. However, for polygenic disorders where risk is dispersed across hundreds of genes and multiple gene sets, self-evidently, gene-or pathwayspecific SNP liability will be lower than the liability captured by genome-wide data, and accordingly, such tests will afford less case-control discriminatory power. Risk scores for each individual per set can be used to stratify individuals for follow-up studies and prioritize genes for further functional studies.
In this paper, we suggest a novel approach to a set-based framework that combines advantages of MAGMA's PCA method and PRS. The proposed POlygenic LD-Adjusted RIsk Score (POLARIS) method aims to improve upon the standard PRS method by correcting the inflated Type I error observed both in standard PRS in the presence of LD (Chatterjee et al., 2013), and also in set-based analyses as the number of SNPs in the set grows (de Leeuw, Neale, Heskes, & Posthuma, 2016). We use spectral decomposition of the SNP correlation matrix to adjust the individuals' allele counts for LD structure. In this paper, POLARIS is presented as a self-contained setbased approach in that it compares the test statistic for the set with the null hypothesis, rather than a competitive approach that accounts for the baseline level of association across the genome. However, it can be turned into a competitive method either by including a general PRS in the analysis, or comparing the set-based PRS to those generated from random sets of genes (matched for number of SNP-sets/set size/numbers of SNPs).
POLARIS informs the analysis with previously reported effect sizes of the SNPs' association with disease. An LDadjusted PRS is calculated per person per set, and the overall set effect is computed using regression. Because the score is used as a predictor in a regression analysis, it is possible to include further population covariates or any other possible confounders. POLARIS uses all available information, because all PCs are incorporated into a score, thus avoiding overfitting that may result from only including the top PCs. As in standard PRS analysis, only one independent variable (apart from extra covariates) is present in the regression model, rather than the number of predictors being equal to the number of markers, or the number of chosen PCs. Like the standard PRS approach, an advantage of our method is that it performs a self-contained test of association in the test dataset, leveraging the discovery set to increase the power of this test. A significant test statistic implies significant association specifically in the test sample, unlike a significant metaanalysis result, where the association evidence could result from other samples. This might be important if the test sample is of specific interest, for example, a different ethnicity, or a different, but related, phenotype.
In the present paper, POLARIS is evaluated by comparing the set-based results calculated using LD-adjusted PRS with those found using MAGMA (de Leeuw et al., 2015) on simulated data, both with a simple constructed LD structure and real data LD pattern.
The POLARIS set-based analysis tool is available to download from github.com/BakerEA/POLARIS. The tool is written in Python and will operate on any computing platform.

For
SNPs in a set, the standard PRS combines single-SNP genotypes ( = 1, … , ) into a single regression predictor using single-SNP effect sizes (log(OR ) = ) taken from a previous study as coefficients, This method implements a two-stage approach, where independent discovery and test sets are available. The effect sizes are determined from the discovery set and vector of the number of risk alleles is obtained from the test set. The underlying assumption is that individual genotypes are available for the test set, but only summary data (effect sizes ) for the discovery set are available.
The standard PRS method does not adjust for LD between markers and thus requires LD pruning (Chatterjee et al., 2013). If markers are in LD, the simple weighted sum (Equation 1) may give them undue weight; indeed, if they are in positive LD, they are likely to have a similar single-SNP effect size and act together, thus giving a larger contribution to the PRS than a single or uncorrelated marker.
We correct for this imbalance due to LD by replacing the vector of genotypes with a vector̃of adjusted dosages. Consider the spectral decomposition of the × markermarker correlation matrix , with eigenvalues satisfying ∑ =1 = tr = and orthonormal (column) eigenvectors . The correlation matrix is the covariance matrix of the joint distribution of individual genotypes after standardization of each SNP. Its eigenvectors indicate the directions of the principal axes of this standardized distribution, and the corresponding eigenvalues give the variances of the distribution in the corresponding directions. In the absence of LD, these variances will be equal to 1, and the distribution will be isotropic. However, if there is LD, then these variances will in general be different, and the standardized distribution will be more elongated in some principal directions and flattened in others.
This anisotropy can be removed by scaling the standardized joint distribution in the direction of each principal axis with the inverse square root of the eigenvalue in this direction. However, adjusting the standardized distribution in this way will not only remove LD, but also equalize the single marker variances, thus discarding information such as the minor allele frequencies. As our aim is to adjust for LD only, but not for single-SNP variances, we therefore have chosen to apply the same scaling transformation to the original, unstandardized joint distribution instead.
More specifically, due to the orthonormality of the eigenvectors, the PRS can be expressed in a spectral decomposition The component , which is the part of along the th principal axis, has correlation matrix eigenvalue and therefore contributes a disproportionate amount of variance to PRS unless ≈ 1. For an uncorrelated marker, one spectral component will be concentrated on this marker, and the corresponding eigenvalue ≈ 1. For our adjustment, we rescale the coordinate of in the direction of the th principal axis, , with the inverse square root of the correlation eigenvalue, giving an adjusted coordinate 1 √ , and hence the rescaled spectral compo- Applying this adjustment to each principal axis will result in an isotropic distribution in which the correlation has mostly been removed, for the adjusted dosage vectors Note this adjustment of multivariate data by correlation is analogous to the calculation of the Mahalanobis distance for mean zero data , −1 = ‖ − 1 2 ‖ 2 , where is the covariance matrix (Mahalanobis, 1936, see also Hotelling, 1931), except that here we use the correlation matrix instead of the covariance matrix in order to avoid adjusting for singlemarker variance.
Using the adjusted dosages̃instead of the original genotype vectors , we obtain an LD-adjusted PRS In the sum over the spectral components, indexed by , the terms with = 0, corresponding to principal directions with no variance, are to be omitted, resulting effectively in a pseudoinverse of the square root of . In cases of extreme LD, where ≈ 0, this formula will apply a large correction factor to the corresponding component, thus possibly amplifying small deviations due to, for example, genotyping error. In order to avoid this instability, we introduce a ridge parameter 0 , for which we suggest the choice 0 = √ 1 , where is the number of individuals in the test data, and modify the adjustment to mitigate the effect of small . This gives rise to the POLARIS risk score, where now̃= , and is the × unit matrix. Note that if all markers are uncorrelated, then ≈ 1 for all , which makes̃≈ , and consequently POLARIS ≈ . We remark further that we applied the adjustment̃= − 1 2 (or the extension with a ridge parameter) directly to the vector of genotypes. More precisely, an adjustment of the variance only will be achieved by removing the sample mean vector̂before the adjustment, giving however, this only amounts to shifting the POLARIS score by a constant ( − − 1 2 )̂, which is irrelevant in the subsequent regression analysis.

POLARIS: Set-based analysis applied to simulated data
To understand detailed differences and similarities between MAGMA (de Leeuw et al., 2015) and POLARIS, we tested both methods on simulated data, both with a simple extreme LD pattern and a real-data LD pattern between SNPs. We tested Type I and II errors by simulating null effects and introducing some association to the SNPs, respectively. We ran a set of experiments to compare the proposed POLARIS method to MAGMA.
To generate summary statistic data and genotype data, a simulated dataset was randomly split into discovery and test sets. The summary statistics for each SNP in the discovery set were computed. The following different scenarios were simulated.
• Scenario A ("One LD Block"): 10 SNPs in an LD Block with (i) 2 = 0.2 and (ii) 2 = 0.8 between consecutive SNPs. The "causal" SNP is associated with disease with OR=1.1 and the remaining nine SNPs have an OR closer to the null value of 1. An additional 90 independent unassociated SNPs are also present in the set, see supplementary Figure S1 for LD structure.
• Scenario B ("Real Data LD"): 115 SNPs from real genetic and environmental risk in Alzheimer's disease (GERAD) data (Harold et al., 2009), see the next section for a detailed description of the data and supplementary Figure S2 for LD structure. For an SNP in a block of strong LD, a number of controls who were homozygous for the risk allele were set to cases, and an equal number of cases homozygous for the protective allele were set to controls, thus producing an association with disease.
For these scenarios, the sample size of the discovery dataset was varied in order to determine the influence of the discovery set sample size on the POLARIS method. For Scenario A, simulations were run creating data with N = 20,000 and 60,000 individuals, these were split equally to result in a test and discovery set each with N = 10,000 and 30,000 subjects, respectively. Additionally, the larger set with 60,000 individuals was split such that the test set had N = 10,000 and the discovery set had N = 50,000 individuals. Scenario B has 13,164 subjects for the combined discovery and test sets; we divided these data 50/50 and 25/75. In both the discovery and test datasets, 30% of the sample size were cases.
A total of 1,000 simulations were performed for each scenario. The power to detect the association between the set and disease is calculated as the proportion of P values from the 1,000 simulations that were below a given P-value threshold; the P-value thresholds used were P = 0.05, 0.01, and 0.001. Ten thousand simulations were used for the real data simulations, thus enabling a more stringent threshold of 0.0001 to be considered. The power of the POLARIS method, applied to the test dataset and informed by the discovery dataset, was compared to the power of MAGMA both applied to the test dataset only and to the total unsplit data samples.

POLARIS: Gene-based analysis applied to real data
Both POLARIS and MAGMA were applied to genotyped AD data to determine gene-wide P values. The GERAD (Harold et al., 2009) genome-wide association study (GWAS) data (3,332 cases, 9,832 controls) were used as the test dataset. International Genomics of Alzheimer's Project (IGAP) (Lambert et al., 2013) data (17,008 cases, 37,154 controls) excluding GERAD subjects (IGAP-noGERAD) were used as the discovery data in order to inform POLARIS with association effect sizes.
International Genomics of Alzheimer's Project (IGAP) is a large study that used genotyped and imputed data on 7,055,881 single nucleotide polymorphisms (SNPs) to metaanalyze four previously published GWAS datasets consisting of 17,008 AD cases and 37,154 controls (The Genetic and Environmental Risk in AD consortium-GERAD, The European Alzheimer's disease Initiative-EADI, the Alzheimer Disease Genetics Consortium-ADGC, and The Cohorts for Heart and Aging Research in Genomic Epidemiology consortium-CHARGE).
For this study, we used only directly genotyped SNPs from the GERAD data (cf. Escott-Price et al., 2014, where however, imputed genotype data were used for IGAP summary statistics analysis). The GERAD and IGAP-noGERAD datasets have 419,048 SNPs in common. It was necessary to ensure that SNP alleles were coded in the same direction across both the discovery (IGAP-noGERAD) and test (GERAD) datasets. If alleles in IGAP-noGERAD were coded in the opposite direction to those in GERAD, the summary effect size for the SNP was inverted. SNPs with alleles AT, TA, CG, or GC were excluded because the direction of the effect could not always be determined when combining two studies. Of the SNPs in IGAP-noGERAD, 103,356 matched those in GERAD, the remaining had effect sizes inverted and no SNPs were excluded due to ambiguity. An MAF filter of 0.01 was applied to the data.
The missing genotypes in real data were imputed as in PLINK (Chang et al., 2015;Purcell et al., 2007), where missing genotypes are substituted by 2 × for each SNP. In the GERAD data, 0.0514% of genotypes required imputation.
SNPs were assigned to genes using GENCODE (v19) gene models (Harrow et al., 2012). Only genes with known gene status and those marked as protein coding were used. No window was used around the gene, only SNPs within the start and end position of the gene were included. SNPs that belong to multiple genes were assigned to all those genes. A total of 202,504 SNPs were assigned to 14,620 distinct genes with a maximum of 1,342 SNPs in a gene.
The results of gene-based analyses for Alzheimer's disease (AD) data using POLARIS were compared to those from the MAGMA-PCA approach in GERAD genotype data and also the MAGMA-SUMMARY approach in IGAP data (not MAGMA-PCA, as the individual genotypes for the whole IGAP data were not available to us). For the latter, we only consider SNPs present in both IGAP and GERAD. Prior to the gene-set analysis, SNP summary statistics for the whole IGAP data were adjusted for the genomic control parameter, =1.087, as reported in Escott-Price et al. (2014).

Type I error
We investigated the Type I error rates in simulations where none of the SNPs have an association to disease (i.e., OR = 1) in either the discovery or test sets, for Scenarios A (one LD block) and B (real data LD), termed A(null) and B(null), respectively. Type I error is deemed acceptable if the nominal value is included in the 95% CI for estimated Type I error rate. The expected Type I error is displayed on the Type I error plots (gray dashed line). Figure 1 shows Scenario A(null). The LD structure for Scenario A can be seen in supplementary Figure S1. Type I error for POLARIS is shown by the blue bars (POLARIS), the red bars (MAGMA D&T Geno) display the Type I error for the MAGMA method in the combined discovery and test individual genotype data and green bars (MAGMA T Geno) show the Type I error for MAGMA in test set genotype data only. In supplementary Figure S3, we additionally show the F I G U R E 1 Type I error comparison of set-based methods at different P-value thresholds; scenario A(null)-simulation of 10 SNPs in LD and 90 independent SNPs Notes: POLARIS with no associated SNPs in both test and discovery sets (blue), MAGMA in the test set only (green), and MAGMA in combined test and discovery sets (red). Expected Type I error is shown by the gray dashed line.
Type I error rate for summary statistics based analysis using MAGMA (MAGMA D&T Summ).
The Type I error rate is reasonable in the majority of cases; the nominal value is included in the 95% CI. MAGMA in the combined data has slightly inflated Type I error at a P-value threshold of 0.001 when the test and discovery set have N = 10000 and 2 = 0.2.
The Type I error for scenario B(null), with the case-control status randomly permuted in order to remove the effect size of any SNPs, is shown in Figure 2 and supplementary Figure S4. The LD structure for this scenario can be observed in supplementary Figure S2. The Type I error rate is within 95% CIs of expected values for most cases, but is somewhat inflated for the summary statistic based analysis using MAGMA.

Power
The power of the POLARIS method (blue bars, POLARIS), MAGMA in the test set only (green bars, MAGMA T Geno) and MAGMA in the combined discovery and test sets (red bars, MAGMA D&T Geno) are displayed for each simulated scenario. The power graphs for Scenario A are shown in Figure 3. The 10 SNPs that are in LD are associated with disease with OR = 1.1.
POLARIS has equivalent power compared with MAGMA in the combined discovery and test sets in all cases. In the most likely realistic situation, the discovery set is larger than the test set, but only summary statistics are available for the discovery set. MAGMA on the combined dataset has higher power where the test N = 10,000 and discovery N = 50,000, but here MAGMA is applied to the individual genotypes of the discovery and test sets combined (N = 60,000), so the sample used to estimate LD and perform the statistical test is very large, whereas POLARIS uses the discovery set N = 50,000 for effect size estimation and only N = 10,000 for LD estimation, and importantly, for statistical testing. In all cases, POLARIS has higher power than MAGMA in the test set only, as is expected, because POLARIS increases power by incorporating information from the discovery set. The power for POLARIS increases when the size of the test set increases, as this improves the estimate of LD between markers. Figure 4 shows the power graph for Scenario B. The power of the POLARIS method lies generally between the power of MAGMA applied to the test set only and MAGMA in the combined test and discovery sets. One can see that by using the information from the discovery set, POLARIS increases the power compared to using the test set only, but, as is to be expected, not as much as using the individual genotypes from the discovery set as well as the test set.
These power results were also compared with the power of the summary statistics based approach implemented in MAGMA, see supplementary Figures S5 and S6. Note that the power of the summary statistics based approach (MAGMA F I G U R E 2 Type I error comparison of set-based methods; scenario B(null)-simulation of 115 SNPs from real data, with permuted phenotypes to remove effect sizes Notes: POLARIS with no associated SNPs in either test or discovery sets (blue), MAGMA in test set only (green), and MAGMA in combined test and discovery sets (red) are compared. Expected Type I error is shown by the gray dashed line.

F I G U R E 3
Power comparison of set-based methods at different P-value thresholds; scenario A-simulation of 10 SNPs in LD with OR=1.1 and 90 independent, unassociated SNPs Notes: POLARIS (blue), MAGMA in the test set only (green), and MAGMA in combined test and discovery sets (red) are compared.  D&T Summ) exceeds the power of MAGMA PCA (MAGMA D&T Geno) on the same combined dataset. Table 1 demonstrates the number and proportion of genes below a particular P-value threshold for POLARIS, MAGMA-PCA in GERAD genotype data and MAGMA-SUMMARY in IGAP summary statistic data. Statistically independent associations in some instances implicate overlapping regions. To define genes as physically independent, we have annealed associated genes that were not separated by at least 250 kb in each analysis separately. In the APOE region, significant genes on chromosome 19 between 44.4 and 46.5 Mb were counted as one. We also present the results for all genes in supplementary Table S1. The number of independent significant genes for all Pvalue thresholds is higher or equal for POLARIS compared to the MAGMA-PCA approach in GERAD data. This is expected as POLARIS uses both GERAD and IGAP-noGERAD data, while MAGMA-PCA uses GERAD genotypes only. The results for the summary statistic approach show higher numbers of significant genes for higher significance thresholds. The five gene-wide significant genes found by the summary statistics approach are: TOMM40, CLU, BIN1, MS4A4E, and CR1, which have all been previously reported as being associated with AD from single SNP analyses (Harold et al., 2009 andLambert et al., 2013). For these five genes, POLARIS also finds an association, but does not always reach gene-wide significance ( = 6.33 × 10 −24 , = 7.17 × 10 −6 , 0.00112, 0.00108, and 0.00065, respectively).

Gene-based analysis: Application to real data
The issue of bias in the estimation of set-based P values caused by gene size is insufficiently tackled by most available methods (de Leeuw et al., 2016;Ruderfer, 2013). Larger genes harbor a larger number of SNPs, and if each SNP has a small inflation in P-value due to, for example, unaccounted stratification, then these large genes will show greater accumulated inflation. To assess whether this is an issue in POLARIS, the phenotypes in GERAD data were permuted to create 1,000 simulations, and for each gene, the empirical P-value (the proportion of P values less than 0.05) was computed. The correlation between the number of SNPs per gene and the empirical P value of each gene in AD data was then determined. We found no evidence (r = 0.0009, P = 0.9096) of a correlation between the number of SNPs in a gene and the gene P-value for the POLARIS method. Therefore, associations with disease observed in larger genes is not simply due to a greater number of SNPs in the gene. Similarly, we observed no evidence (r = -0.00321, P = 0.6977) of an inflation in P-value for increasing gene size using MAGMA-PCA on GERAD data. When considering the correlation between the IGAP genebased P-value and set size, we observe a statistically significant negative correlation (r = -0.083, P< 2.2 × 10 −16 ) when MAGMA-SUMMARY is used on summary data, indicating that the higher the number of SNPs, the lower the set-based P-value.

DISCUSSION
In this paper, we present a method for accounting for LD in the calculation of a PRS. The resulting individual LDadjusted PRS can also be used for analyzing whether a set of SNPs is associated with disease. This method combines the advantages of PRS and spectral analysis of the genetic data. The latter suggests a mathematically sound adjustment for LD and includes a stabilization parameter (similar to ridge regression) to cope with cases of extreme LD. It adjusts for LD between SNPs and informs the analysis with previously reported effect sizes of a SNP's association with disease. In the present study, we have chosen to do this adjustment using the SNP-SNP correlation matrix; however, one could alternatively use the SNP-SNP covariance matrix. For all examples above, this gives very similar results. Partitioning the overall polygenic risk based on meaningful SNP sets, the method allows both to test for significance of association of these sets (set-based analysis) and to provide individual set-specific risk scores for subjects, which can further be used for risk prediction of subphenotypes with respect to the SNP sets.
To assess the quality of the proposed approach, we compare its use for set-based analysis with the widely used MAGMA software. We show that POLARIS gives the correct Type I error and its power lies between that of MAGMA applied to the test dataset only and MAGMA applied to the combined test and discovery datasets. In practice, researchers would use all the available genotype data, and would use PRS-based methods only if effect sizes only are known for an additional dataset.
POLARIS has four main advantages. (1) It produces a risk score per person per set, unlike other set-based methods which only provide a P-value for the strength of association between the set and disease. This set risk score can be used to stratify individuals for follow up studies (e.g., clinical trials) and also prioritize genes for further functional studies (e.g., animal models), supporting the development of precision medicines.
(2) POLARIS can increase power by leveraging the discovery set to perform a self-contained test of association in the test dataset. Another way to incorporate the discovery set would be to use meta-analysis, however, this detects an association in the combined set rather than the test set only. This may be important when the test data differs in some way from the discovery data, for example different ethnicity, or different phenotype. A good example might be where the test sample uses different diagnostic criteria to measure the same phenotype (e.g., self-report questionnaire for depression) and one wishes to validate these criteria by showing that they show association to the same genes as those implicated by the standard diagnosis. (3) POLARIS is not inflated by set size. This is an issue previously reported in summary statistic based approaches (Chatterjee et al., 2013;Ruderfer, 2013) and is shown to be the case using MAGMA on the IGAP summary statistic data. (4) The overall set association can easily be adjusted by population or any other covariates.
For set-based analyses, in situations where only summary statistics are available for part of the data, we suggest to use the POLARIS method which unites advantages of the PRS approach and the PCA-based set-based method, while taking into account the LD structure for the genetic region of interest.
POLARIS can be used for any set of SNPs, for example, the whole genome, genes, or pathways. Therefore, it has the potential of data-driven discovery of pathways.
POLARIS can also be utilized in a number of crossdisorder analyses to determine commonality between disorders at a gene-based or pathway-based level. There are a number of common disorders for which the GWAS summary data are publically available (e.g., Psychiatric Genomics Consortium). The GWAS data for one disorder can be used to generate scores per person per gene in another disorder or subphenotype of interest, and thus test for overlap between disorders at a gene-based level.
The POLARIS method can be extended to add additional information into the score, such as rare variants from exome sequencing studies. The POLARIS set-based method is implemented into a freely accessible platform independent software. Large sets have a high computational burden due to the spectral decomposition of large correlation matrices; recommendations on maximum set size and the corresponding required computational resource are included with the software.
In this study, POLARIS was applied to test binary traits. However, the POLARIS score can also be used as a variable (along with other covariates) in regression models for quantitative traits.
A limitation of the POLARIS implementation is that currently it is only available as a self-contained set-based method. However, POLARIS can in principle be used as a competitive set analysis, adjusting for the baseline level of association in the data either by including a general PRS in the analysis or comparing the set-based PRS to those generated from random sets of genes (matched for number of genes/gene size/numbers of SNPs) or random sets of SNPs (matched for LD, MAF, and SNP density).
Another limitation of PRS-type approaches is the imperfect tagging of the underlying causal variants by SNPs and imperfect effect size estimates. The challenge of selecting the true set of susceptibility SNPs for PRS modeling to capture heritability has been pointed out (Chatterjee, Shi, & Garcia-Closas, 2016). Our approach can use all SNPs in a set of interest, even when in LD, and therefore any causal genotyped SNPs will be included. If the causal SNPs are not present in the sample, then the tagging SNPs only are used. The effect sizes of all SNPs in LD will be adjusted according to the LD structure, not according to the causal/noncausal nature of the SNP.
POLARIS is a valuable extension to standard PRS by adjusting for LD between markers and removing the necessity to LD prune data prior to analysis. POLARIS provides a test of the set's association with disease while also producing subject specific risk scores.

ACKNOWLEDGMENTS
We thank the MRC Centre for Neuropsychiatric Genetics and Genomics for supporting this project and the MRC for supporting author Emily Baker (MR/K013041/1) . This project was also supported by Dementia Project UK (DPUK).
Data used in the preparation of this article was obtained from the Genetic and Environmental Risk for Alzheimer's disease (GERAD1) Consortium (Harold et al., 2009).
We thank the International Genomics of Alzheimer's Project (IGAP) for providing summary results data for these analyses. The investigators within IGAP contributed to the design and implementation of IGAP and/or provided data but did not participate in analysis or writing of this report.