Predictive modeling of schizophrenia from genomic data: Comparison of polygenic risk score with kernel support vector machines approach

A major controversy in psychiatric genetics is whether nonadditive genetic interaction effects contribute to the risk of highly polygenic disorders. We applied a support vector machines (SVMs) approach, which is capable of building linear and nonlinear models using kernel methods, to classify cases from controls in a large schizophrenia case–control sample of 11,853 subjects (5,554 cases and 6,299 controls) and compared its prediction accuracy with the polygenic risk score (PRS) approach. We also investigated whether SVMs are a suitable approach to detecting nonlinear genetic effects, that is, interactions. We found that PRS provided more accurate case/control classification than either linear or nonlinear SVMs, and give a tentative explanation why PRS outperforms both multivariate regression and linear kernel SVMs. In addition, we observe that nonlinear kernel SVMs showed higher classification accuracy than linear SVMs when a large number of SNPs are entered into the model. We conclude that SVMs are a potential tool for assessing the presence of interactions, prior to searching for them explicitly.

A major controversy in psychiatric genetics is whether nonadditive genetic interaction effects contribute to the risk of highly polygenic disorders. We applied a support vector machines (SVMs) approach, which is capable of building linear and nonlinear models using kernel methods, to classify cases from controls in a large schizophrenia case-control sample of 11,853 subjects (5,554 cases and 6,299 controls) and compared its prediction accuracy with the polygenic risk score (PRS) approach. We also investigated whether SVMs are a suitable approach to detecting nonlinear genetic effects, that is, interactions. We found that PRS provided more accurate case/ control classification than either linear or nonlinear SVMs, and give a tentative explanation why PRS outperforms both multivariate regression and linear kernel SVMs. In addition, we observe that nonlinear kernel SVMs showed higher classification accuracy than linear SVMs when a large number of SNPs are entered into the model. We conclude that SVMs are a potential tool for assessing the presence of interactions, prior to searching for them explicitly.

K E Y W O R D S
polygenic risk score, schizophrenia, support vector machines

| INTRODUCTION
Schizophrenia has a complex, polygenic architecture in which a large number of genetic variants spanning a wide spectrum of population frequencies contribute to disease risk (International Schizophrenia Consortium et al., 2009;Lee et al., 2012). In recent years, specific schizophrenia-associated risk variants have begun to emerge from large-scale genomic studies (Sullivan, Daly, & O'Donovan, 2012).
The standard approach to genome-wide association study (GWAS) data assumes an additive model, which, in statistical terms, is equivalent to looking for the main effects of variants contributing to disease risk. The assumption of additivity has been an extremely effective approach, but it is also pragmatic, since looking at the effects of many 100,000 s of single nucleotide polymorphisms (SNPs) would be rendered computationally expensive if all potential combinations of interactions were considered (Cordell, 2009). In addition, the excessive dimensionality of such an approach would require very severe statistical correction for multiple comparison testing (Polderman et al., 2015). Although testing for some interactions is now technically possible using graphical processing units (GPU) instead of central processing units (CPU; Hemani, Theocharidis, Wei, & Haley, 2011), extremely large sample sizes will be required to achieve sufficient power to detect small genetic interaction effect sizes, as are expected in most complex genetic traits, at the very low significance thresholds dictated by multiple testing correction.
The extent to which genetic interactions contribute to risk is unknown. One study, which explicitly modeled and tested gene expression traits for evidence of pairwise SNP interaction effects (Hemani et al., 2014), found strong evidence in favor of interactions.
Moreover, additional analyses (Hemani et al., 2014) suggested that testing for interactions between SNPs with known main effects (e.g., genome-wide significant SNPs) is unlikely to be the best strategy, as the majority of interactions involved SNPs that did not have a significant main effect on gene expression. Here, we sought to investigate whether a support vector machine learning (SVM) approach can identify the presence of interactions, without explicitly specifying interaction terms in regression models.
Our study attempts to use SNPs to distinguish schizophrenia patients from controls, where genetic differences between the groups are more subtle than between populations. for an association with schizophrenia (p ≤ .01), was AUC = 0.58 and 0.70, respectively. In the present study, we employ SVM algorithms which offer both linear and nonlinear modeling options and hence may account for pairwise and higher order SNP interactions, to explore a large schizophrenia case-control sample of 11,853 subjects (5,554 cases and 6,299 controls) for classification of SZ cases and controls. We use GWS SNPs and SNPs with suggestive evidence for association with schizophrenia.
Finding a suitable SVM model to accurately predict classification in the data involves mapping the original data points into a higher dimensional space via a kernel function. However, there is still no consensus in the field as to the optimal approaches, and there is a specific need for research to assess the advantages and limitations of machine learning algorithms when applied to genetic data (Ban, Heo, Oh, & Park, 2010). For example, a number of SVM kernel options exist; depending upon the specific choice of kernel, performance of the SVM can be further tuned by adjusting a number of parameters. We chose to investigate linear and radial basis function kernel (RBF) kernels as representatives of linear and nonlinear analyses. A number of nonlinear kernels exist and in the absence of specific knowledge that would suggest another choice, the RBF kernel makes a good default kernel to test a nonlinear model. A linear kernel has one hyperparameter C, reflecting whether the separation between the points is "strict" or "soft", i.e., how strictly misclassifications are penalized; the RBF kernel has an additional parameter γ which controls the curvature of the decision boundary that separates the regions of classification.
In a direct comparison of the results of SVM analyses with those derived from a standard additive model, we first analyzed the 125 GWS autosomal SNPs from PGC2 (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014) in our sample. We tested (a) whether the prediction accuracy of the (additive) PRS built upon GWS SNPs, was greater/smaller than that of SVM algorithm and (b) whether SVM can indicate potential interactions between GWS SNPs. We also explored a set of top~5,000 independent SNPs most associated with schizophrenia in the PGC2 study (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014), and tested those for the presence of potential interactions.

| Schizophrenia case-control data
We used the CLOZUK sample of 5,554 schizophrenia cases with treatment-resistant schizophrenia receiving the antipsychotic clozapine, and 6,299 control samples. Individuals were genotyped on different arrays (see [Hamshere et al., 2013;Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014] for details) in two batches, called "Batch 1" (3,446 cases and 4,285 controls) and "Batch 2" (2,108 cases and 2,014 controls) hereafter. To reduce the batch effect bias in this study the data were imputed using 203,436 autosomal SNPs common to both batches. Standard quality control steps were undertaken (Hamshere et al., 2013), including INFO score threshold ≥0.9 for the SNP imputation, genotype missing rate < 2%, Minor Allele Frequency (MAF) ≥10%; Hardy-Weinberg Equilibrium (HWE) significance level p ≥ 10 −4 . In addition, the extended Major Histocompatibility Complex (MHC) region (chr6: 25-34 Mb) was removed due to the highly correlated nature of the SNPs in this region. The imputed data were converted into the most probable genotypes (probability >0.9), and Linkage Disequilibrium (LD) pruned, keeping the schizophrenia associated SNPs and removing those in LD (the window around associated SNPs 500 KB, r 2 value for the LD threshold 0.1). For SNP prioritization purposes, we used genome-wide association summary statistic data, (available at https://www.med. unc.edu/pgc/results-and-downloads) of the PGC2 study (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014), excluding the CLOZUK subset. As the SVM algorithms require complete data, missing genotypes were imputed using a multinomial distribution with the corresponding SNP genotypes frequencies.
A known problem when searching for gene-gene interactions in a GWAS is that it cannot be expected that SNPs with the largest main effects are also most likely to be involved in interactions. SNPs with no main effects are just as likely to be involved in an interaction. For example, the XOR and anti-diagonal disease risk models specify a nonlinear interaction in the absence of independent main effects (Dong et al., 2008).

| SVM
The data analyses were carried out in Python, using two main packages (scikit-learn and pandas) for machine learning and data proces- In order to compare the accuracy of linear and nonlinear modeling with the objective of identifying the presence of SNP x SNP interactions, the full datasets were used to build the SVM models. The data were randomly split into train/test subsets using 75%/25% proportions, then the SVM models were built and tested on a large number (100 times) of such splits for each of the different kernels to obtain distributions of accuracy scores for each model. The metric used for all of the performance assessment was the Area Under the receiver operating characteristic (ROC) curve (AUC) metric, also known (ROC) score (Metz, 1978).

| Polygenic risk score
PRS is a method to summarize the trait variance captured by a set of genetic variants. We followed the approach previously described by approach with SVMs, we used exactly the same random splits (75%/25%) as for SVMs and generated individual PRSs in the 25% of the data, using summary results (β-coefficients) from the 75% of the data. As above, this procedure was repeated 100 times.
Finally, the performances of the PRS and SVM models were compared using a t test for differences in mean AUC-ROC scores for each approach, for different data classification models (PRS, SVMs with linear, and RBF kernels) in each data set (Batch 1 and Batch 2) separately, and in the combined dataset (Batch 1 + Batch 2).

| RESULTS
@@Overall, the maximal prediction accuracy achieved by SVM was AUC = 0.60-0.66, indicating that this approach did not give a practically applicable model for prediction of schizophrenia using these data. The following results focus on the comparison of the performance of the linear and nonlinear prediction modeling.

| GWAS significant SNPs
Initially, we built and tested models using the 125 GWS SNPs in the To assess whether an increase in sample size improves the accuracy of the SVM models, we combined the two data sets and repeated the analyses. The results are shown in Figure 1. The results show that the accuracy of the PRS and SVM-RBF models did not improve in the larger dataset analysis (p = .895 and p = .2, respectively), whereas in contrast, the accuracy of SVM-Linear model did improve (p = .005).
However, the fact that the accuracy of the RBF kernel SVM did not improve with the larger sample indicates that it is unlikely that there are detectable nonlinear effects in the data (i.e., no detectable interactions between the GWAS significant SNPs). The latter result replicates the finding of an absence of interactions between these SNPs from (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014).

| GWAS significant and suggestive SNPs
Following a similar strategy to the above, we compared the analyses on the data for 4,998 independent top SNPs, initially in the larger Batch 1 data and then in the combined dataset (Batch 1 + Batch 2).
The accuracy of PRS and SVM-RBF models improved compared to models built upon the 125 GWS SNPs (compare Figures 1 and 2). In Batch 1 data (red boxes in Figures 1 and 2), median ROC-AUC scores

| Polygenic risk scores versus multivariate regression and linear kernel SVM
We note that in the results shown, the PRS has consistently higher accuracy than the SVM models. This may be related to the fact that PRS also outperforms multivariate regression, for the following reason. Since the SNPs contributing to the PRS are prioritized for association with the disease, the risk alleles are more common among cases for each SNP. Therefore, even if associated SNPs are pruned for LD, they appear to be correlated in a case/control sample, because they are associated with disease. To illustrate this effect, we ran 1,000 simulations generating two independent SNPs which were both associated with disease in 10,000 individuals. The strength of each SNP's association to disease was varied from OR = 1 to OR = 4 and the effect of each SNP on disease risk was independent of the genotypes present at the other. Both SNPs had a MAF of either 0.2 or 0.3. The correlation in the whole sample, and in cases and controls separately was computed for each simulation. The average of these values was found across all 1,000 simulations. Figure 3 shows that separately in cases only and controls only, the correlation coefficients are approximately zero, suggesting that SNPs are independent. However, there is a larger correlation in the combined case/control data, with correlation coefficient r increasing as the strength of association and MAF increase. Figure 4 shows the direct comparison between −log 10 (pvalues) obtained with multivariate logistic regression (model including each SNP as an independent variable) and PRS for two independent SNPs with effect sizes OR = 1.2 and MAF = 0.3 in 1,000 simulations. The p-values for PRS analyses are systematically lower than for multivariate regression. The effect is even more pronounced when the number of SNPs increases. We also compared the prediction accuracy of the linear kernel SVM and multivariate logistic regression models.
The accuracy, as expected, was very similar (results are not shown).
Since both the multivariate regression and SVM-Linear models include all SNPs as separate variables for prediction, a direct comparison of the SVM-Linear model with the multivariate regression analysis is natural. However, such comparison was not viable in our schizophrenia samples due to the limitation on the number of predictor variables in multiple linear regression modeling for accurate estimation of regression coefficients (Austin & Steyerberg, 2015), and therefore we used PRS as a proxy.

| DISCUSSION
The aim of this study was to examine the application of linear and nonlinear SVMs to identifying the presence of genetic interactions which may contribute to the risk of schizophrenia, and to compare their prediction accuracy with polygenic risk score approach. Neither of the SVM models demonstrated a better predictive accuracy than the polygenic risk score approach. However, we have shown that due to the selection of SNPs for the PRS, the PRS approach will have an advantage in terms of statistical power and prediction accuracy over multivariate analyses, including SVM with linear kernel and multivariate regression analyses. We note that the PRS score can be adjusted to remove the SNP-SNP correlation in the data, which makes the PRS approach essentially equivalent to multivariate regression . It is also possible to remove only LD between SNPs, but retain the correlation due to association by using the POLARIS method (Baker et al., 2018).
In conclusion, the RBF kernel prediction accuracy shows substantial improvements over the results from the linear kernel in real data with 4,998 SNPs. This finding indicates the potential presence of interactions between these relatively weakly associated schizophrenia genetic variants, over and above the main effects of these SNPs, whilst we found no evidence for interactions among genome-wide significant index SNPs. Based on the results of the present study, we conclude that PRS remains the better option for the purpose of classification of schizophrenia cases from controls, although its currently demonstrated accuracy is insufficient for clinical practice.

ACKNOWLEDGMENTS
The work at Cardiff University was supported by Medical Research Council (MRC) Centre (MR/L010305/1) and Program Grants (G0800509). The