A data‐driven investigation of relationships between bipolar psychotic symptoms and schizophrenia genome‐wide significant genetic loci

The etiologies of bipolar disorder (BD) and schizophrenia include a large number of common risk alleles, many of which are shared across the disorders. BD is clinically heterogeneous and it has been postulated that the pattern of symptoms is in part determined by the particular risk alleles carried, and in particular, that risk alleles also confer liability to schizophrenia influence psychotic symptoms in those with BD. To investigate links between psychotic symptoms in BD and schizophrenia risk alleles we employed a data‐driven approach in a genotyped and deeply phenotyped sample of subjects with BD. We used sparse canonical correlation analysis (sCCA) (Witten, Tibshirani, & Hastie, 2009) to analyze 30 psychotic symptoms, assessed with the OPerational CRITeria checklist, and 82 independent genome‐wide significant single nucleotide polymorphisms (SNPs) identified by the Schizophrenia Working group of the Psychiatric Genomics Consortium for which we had data in our BD sample (3,903 subjects). As a secondary analysis, we applied sCCA to larger groups of SNPs, and also to groups of symptoms defined according to a published factor analyses of schizophrenia. sCCA analysis based on individual psychotic symptoms revealed a significant association (p = .033), with the largest weights attributed to a variant on chromosome 3 (rs11411529), chr3:180594593, build 37) and delusions of influence, bizarre behavior and grandiose delusions. sCCA analysis using the same set of SNPs supported association with the same SNP and the group of symptoms defined “factor 3” (p = .012). A significant association was also observed to the “factor 3” phenotype group when we included a greater number of SNPs that were less stringently associated with schizophrenia; although other SNPs contributed to the significant multivariate association result, the greatest weight remained assigned to rs11411529. Our results suggest that the canonical correlation is a useful tool to explore phenotype–genotype relationships. To the best of our knowledge, this is the first study to apply this approach to complex, polygenic psychiatric traits. The sparse canonical correlation approach offers the potential to include a larger number of fine‐grained systematic descriptors, and to include genetic markers associated with other disorders that are genetically correlated with BD.

The etiologies of bipolar disorder (BD) and schizophrenia include a large number of common risk alleles, many of which are shared across the disorders. BD is clinically heterogeneous and it has been postulated that the pattern of symptoms is in part determined by the particular risk alleles carried, and in particular, that risk alleles also confer liability to schizophrenia influence psychotic symptoms in those with BD. To investigate links between psychotic symptoms in BD and schizophrenia risk alleles we employed a data-driven approach in a genotyped and deeply phenotyped sample of subjects with BD. We used sparse canonical correlation analysis (sCCA) (Witten, Tibshirani, & Hastie, 2009) to analyze 30 psychotic symptoms, assessed with the OPerational CRITeria checklist, and 82 independent genome-wide significant single nucleotide polymorphisms (SNPs) identified by the Schizophrenia Working group of the Psychiatric Genomics Consortium for which we had data in our BD sample (3,903 subjects). As a secondary analysis, we applied sCCA to larger groups of SNPs, and also to groups of symptoms defined according to a published factor analyses of schizophrenia. sCCA analysis based on individual psychotic symptoms revealed a significant association (p 5 .033), with the largest weights attributed to a variant on chromosome 3 (rs11411529), chr3:180594593, build 37) and delusions of influence, bizarre behavior and grandiose delusions. sCCA analysis using the same set of SNPs supported association with the same SNP and the group of symptoms defined "factor 3" (p 5 .012). A significant association was also observed to the "factor 3" phenotype group when we included a greater number of SNPs that were less stringently associated with schizophrenia; although other SNPs contributed to the significant multivariate association result, the greatest weight remained assigned to rs11411529. Our results suggest that the canonical correlation is a useful tool to explore phenotype-genotype relationships. To the best of our knowledge, this is the first study to apply this approach to complex, polygenic psychiatric traits. The sparse canonical correlation approach offers the potential to include a larger number of fine-grained systematic descriptors, and to include genetic markers associated with other disorders that are genetically correlated with BD.

| I N TR ODU C TI ON
Bipolar disorder (BD) is a severe, often recurrent, mental illness, associated with disability, suicide, and a reduction in life expectancy of over 10 years (Vos et al., 2015). Pervasive high mood and increased energy are core features of the disorder, characteristically alternating with spells of depression and normal mood states (V azquez, Holtzman, Lolich, Ketter, & Baldessarini, 2015). BD is clinically heterogeneous; psychotic symptoms are present in some individuals but not others, and when these occur, they can be indistinguishable from those present in people with schizophrenia (Craddock, O'donovan, & Owen, 2005;Grande, Berk, Birmaher, & Vieta, 2016).
Molecular and epidemiological studies have reported strong evidence of shared genetic etiology between BD and schizophrenia (Andreassen et al., 2013;Cardno and Owen, 2014;Cardno, Rijsdijk, Sham, Murray, & McGuffin, 2002;Lichtenstein et al., 2009;Purcell et al., 2009;Sullivan, Daly, & O'Donovan, 2012). It is now established that common genetic variants contribute liability to both disorders, and in addition, the fraction of heritability conferred by such variants to schizophrenia and BD is substantially (around 68%) correlated (Lee et al., 2013).
A number of studies have aimed to identify characteristics of the BD phenotype that are most strongly liked to schizophrenia risk, and have generally done so by testing predefined subgroups of BD patients against total burden of schizophrenia risk alleles. Such studies have shown that in people with BD, the burden of alleles identified in studies of schizophrenia is highest in those with psychotic symptoms (Allardyce et al., 2017;Goes et al., 2012), while conversely, in people with schizophrenia, the burden of alleles identified in studies of BD is highest in people with manic symptoms (Ruderfer et al., 2014). While studies of total risk burden are providing insights into the relationships between schizophrenia and BD, a limitation of this approach is that alleles identified from studies of one disorder are considered to act uniformly on a particular symptom, or set of symptoms, in the context of people with the other disorder. Given both schizophrenia and BD are highly heterogeneous disorders, if genetic heterogeneity underpins phenotypic heterogeneity, such universal genotype-phenotype relationships are unlikely to apply.
An alternative approach is to use data-driven approaches to seek novel relationships between phenotypic variables and genotypes.
However, such analyses are challenging in the context of the highdimensional data, which is comprised of large numbers of associated alleles, even larger numbers of combinations of alleles, and potentially thousands of phenotypic data points and phenotypic combinations (Ehrenreich & Nave, 2014). Here, we have begun to address this problem using canonical correlation analysis (CCA) (Hotelling, 1936), an approach designed to identify linear relationships (usually hidden) between two sets of multidimensional variables. We exploit more recently developed sparse CCA (sCCA) (Witten et al., 2009), which addresses the high computation burden of CCA for high-dimensional data by minimizing the number of features used in both phenotypic variables and genotypes while maximizing the correlation between the two sets.
The broad hypothesis underpinning our study is that schizophrenia liability is not randomly distributed in individuals with BD; rather liability is enriched among people with BD who manifest particular clinical features. We have previously shown that en masse, schizophrenia liability is linked to psychotic symptoms in BD (Allardyce et al., 2017).
Here, we aim to extend that finding to investigate, in a purely datadriven manner, the possibility that further granularity exists between schizophrenia liability and BD, specifically, do particular schizophrenia risk alleles (or groups of alleles) show evidence for relatively selective effects on particular psychotic features in people with BD. By way of comparison, we also undertook a phenotypic hypothesis-based CCA, based on grouping symptoms according to a three factor classification of schizophrenia symptoms (Cardno et al., 1996).

| Diagnostic assessments
Information was collected by interviewing participants with the Schedules for Clinical Assessment in Neuropsychiatry (Wing et al., 1990). Psychiatric and general practice case notes were also reviewed.
Interview and case note data were combined. Participants were diagnosed using DSM-IV criteria, including 2,628 cases with BD-I, 1,089 cases with BD-II, 124 cases with Schizoaffective BD, and 66 cases with BD NOS. Fifty-three percent patients with psychotic features. Clinical ratings were made according to the OPCRIT (OPerational CRITeria) checklist (McGuffin, Farmer, & Harvey, 1991). Originally designed to facilitate a polydiagnostic approach to psychotic and mood disorders for molecular genetic research, OPCRIT includes items on psychopathology and history. For the current analyses we used items concerning psychotic symptoms, rated on a lifetime-ever basis (summarized in Table 1). Team members involved in the interview, rating, and diagnostic procedures were all fully trained research psychologists or psychiatrists.

| Quality control for OPCRIT data
The information on OPCRIT measurements was available for 4,589 BD subjects of European ancestry. The OPCRIT items most frequently rated as present also have higher missing value rates (>10%). For the CCA we excluded subjects if they had three or more missing values among the OPCRIT items; retaining 3,903 subjects.  Description of OPCRIT items. Columns present missingness and presence of the OPCRIT items in the full sample (N 5 4,589) and cleaned sample (N 5 3,903), respectively. The last two columns present two different ways of lumping OPCRIT items into three groups (groups defined by schizophrenia factor analysis; groups defined using phenomenological approach).
by principle components (PCs) analysis in a joint analysis of 2,000 subjects from 19 different populations taken from the 1000 Genome project (The 1000 Genomes Project Consortium, 2015).
To remove SNPs due to genotyping platform difference, SNP frequencies were compared in each pair of cohorts (with logistic regression) and removed if their frequencies were significantly different (p < .01). Our final data set contains 3211519 imputed SNPs. Ten PCs were generated and used to control for population stratification (see Figure 1 for first two PCs). The total number of individuals after QC was 3,903.

| SNP selection
Imputed genotypes were clumped for linkage disequilibrium (LD; window 1,000 kb, r 2 5 .2) using PLINK2 (Chang et al., 2015) retaining the SNPs most significantly associated with schizophrenia (The Psychiatric Genomics Consortium, 2014). We excluded the MHC region because of its complex long-range LD properties (Price et al., 2008) and discarded schizophrenia associated SNPs for which variation in imputation across arrays resulted in low-quality data. We retained 82 LD independent SNPs for further analysis (Supporting Information CCA finds two sets of basis vectors for two sets of variables, such that the correlations between the projections of the variables onto the space spanned by the basis vectors, are mutually maximized (Hotelling, 1936). The dimensionality of these new bases is equal to, or less than, the smallest dimensionality of the two sets of variables, in our case, the minimum numbers of phenotypic variables and SNPs.
Formally, the CCA concept can be described as follows. Consider n subjects with two sets of multidimensional measurements, phenotypes (number of measured phenotypes is p) and genotypes (number of SNPs is equal to q). Then X is an n3p Extension of CCA to sCCA makes the technique more suitable for analyzing large correlated datasets (e.g., when p 1 q exceeds n). sCCA aims to find the "sparse" solution, that is, those projections that depend on a small number of variables, making the analysis more robust and powerful (Witten et al., 2009).
Similar to ordinary CCA, sCCA searches for canonical variates u and v that maximize the correlation cor Xu; Yv ð Þwith additional convex penalty functions P 1 u ð Þ c 1 , P 2 v ð Þ c 2 . Parameters c 1 and c 2 give the numbers of variables for X and Y that have nonzero weight.
To understand how well the sCCA captures the relationship between the two matrices, p-values are usually computed using a permutation approach. In brief, computation of sCCA is performed in three stages.
First, sCCA is run with a permutation option where the best parameters c 1 and c 2 are chosen based upon p-values. Second, sCCA is run with the best coefficients c1 and c2 from stage 1.

| Imputation of phenotypes and genotypes
The proportion of missing data directly affects the quality of (s)CCA analyses as individuals with missing values are usually removed and as a consequence, power is decreased. There is no established cutoff for missing value thresholds suitable for all datasets; some suggest that 5%-10% missingness is acceptable, depending on the patterns of the missing data (Tabachnick & Fidell, 2007), others have applied a 15%-20% threshold for genomic data (Lin et al., 2013), and 20% for clinical data (Bennett, 2001).
In our data, after QC, the missing value rate does not exceed 11% for phenotypes and 5% for genotypes (see Table 1

| OPCRIT groups of symptoms
OPCRIT items were combined to generate three groups of symptoms: "factor 1," "factor 2," and "factor 3" (Table 1), as suggested by schizophrenia factor analysis (Cardno et al., 1996). In this way, the sCCA may be more powerful since the number of phenotype variables (30) is reduced, there are no missing values, the ambiguity of the imputation is minimized, and the frequencies of symptom "presence" in each the group are increased, as compared to individual OPCRIT item frequencies. Groups were coded as 0/1, where 1 represents that at least one OPCRIT symptom is present in a group. The numbers of people with the present symptom in each group were 1,655, 261, 1,385 for "factor 1," "factor 2," "factor 3" groups, respectively. Note that an individual can be assigned to more than one group and 1,895 subjects did not belong to either of the groups above. The correlation structure of the three dimensions is depicted in Figure 2.
To investigate our results further, we also explored a three-cluster model based on phenomenological approach, grouping the items in three clusters, see Table 1: "cluster 1" (including positive symptoms, present in 1,964 participants), "cluster 2" (including negative symptoms, present in 247 participants), and "cluster 3" (including disorganized symptoms, present in 480 participants).  Table 2. Phenotypes and genotypes with nonzero weights chosen by sCCA are shown in columns "phenotypes" and "SNPs," respectively.

| R E SU LTS
Weights can be interpreted as unstandardized regression coefficients and can be negative or positive (Supporting Information  Table 3).
The analysis of grouped OPCRIT items revealed a significant association between the "factor 3" group and the same SNP, rs11411529.
As shown in Table 1, the "factor 3" group includes both grandiose delusions and bizarre behavior. A post hoc within case logistic regression analysis confirmed association between the "factor 3" group and rs11411529 (p 5 9.1 3 10 25 , OR 5 0.79). The direction of the association was such that the schizophrenia (SZ) risk allele was associated with membership of this group, and is in agreement with the direction identified by sCCA.
As a further test, we applied sCCA to a randomly chosen half of the sample. The results were similar, rs11411529 and "factor 3" group were identified as significantly correlated (p 5 .036), a finding that replicated in the second (independent) half of the sample using logistic regression (p 5 .03; OR 5 0.83).
As an exploratory analysis, we performed sCCA using sets of SNPs that are expected to be enriched for true associations to schizophrenia, but for which the evidence for association does not meet the definition of genome-wide significance (Supporting Information Table 4). The "factor 3" group was consistently identified as the only group that correlated with schizophrenia risk alleles, and rs11411529 remained the FIGURE 2 Correlation matrix between OPCRIT groups of symptoms defined by schizophrenia factor analysis (see Table 1): "factor 1," "factor 2," and "factor 3" [Color figure can be viewed at wileyonlinelibrary.com] main contributor to that association (Supporting Information Table 5).
When GWS SNPs were excluded, sCCA analysis found no significant canonical correlations.
We then tested the association between genotype and three clusters of symptoms, defined using a phenomenological approach.
sCCA detected a borderline significant association (p 5 .052) between "cluster 1" (which included both delusions of influence and grandiose delusions), and the same SNP rs11411529. The sCCA analysis did not identify significant association when using SNPs on less significant schizophrenia associated p-value thresholds, see Supporting Information

| D I SCUSSION
BD and schizophrenia are distinct categorical entities according to current diagnostic systems. Nevertheless, the two disorders share many clinical features-for example, up to 50% of patients with BD present with symptoms that are common in schizophrenia such as persecutory delusions, auditory hallucinations, experiences of influence, and catatonic symptoms (Pope & Lipinski, 1978) and it is now clear their genetic etiologies also substantially overlap. The relationships, if any, between the genetic and clinical overlaps are unclear, although recent studies suggest schizophrenia risk is particularly elevated in people with BD and mood incongruent psychotic features (Allardyce et al., 2017;Goes et al., 2012).
Seeking to identify novel genotype-phenotype links, we have applied sCCA to a well-phenotyped and genomically informative sample.
sCCA is a data-driven approach that can estimate the strength of the relationships between two sets of variables (in our example, genotypes and phenotypes); in doing so, sCCA has the potential to identify novel genotype-phenotype links without investigators imposing highly specific hypotheses. To our knowledge this is first study of its type.
The primary finding of the hypothesis-free analysis was that a cluster of symptoms comprising the most common delusions in our sample (grandiose, of influence, as well as bizarre behavior) are particularly associated with a schizophrenia risk allele. Note that this association is a "within case" association, and this allele has not been reported as GWS associated with BD in any case-control analysis. The association was primarily driven by a single variant rs11411529, which tags a locus spanning three genes, CCDC39, DNAJC19, and FXR1. It is as yet unclear which (if any of these three) confer is involved in schizophrenia susceptibility.
A second analysis in which we impose a structure to the BD phenotype based upon factor analysis of symptoms in schizophrenia identified the same allele to be associated with "factor 3" group. Being constrained, the latter analysis does not fully exploit the potential of sCCA, but the reduced dimensionality of that analysis enhances power, allowing us to detect associations once again between "factor 3" group and a larger number of SNPs based upon more relaxed significance criteria. It should be noted that sets of SNPs at those sub GWS significant thresholds are nevertheless enriched or true associations, indeed among the eight SNPs with a threshold P 5 10 25 from the PGC (The Psychiatric Genomics Consortium, 2014) that together show significant evidence for association with disorganized features, 5 map to loci that are GWS in a larger recent schizophrenia GWAS dataset (Pardiñas, 2018); in addition to rs11411529 these include; rs999494 (EMX1); rs75968099 (TRANK1); rs6803008 (FOXP1); rs5004844 (CNTN4). The TRANK1 locus has previously been reported to be significant in a case-control study of BD (Chen et al., 2013), and the index SNP rs75968099 is also significant in the GWAS (The Psychiatric Genomics Consortium, 2014), from which we selected alleles to be tested in this study. We speculate that the inclusion of this SNP as contributing to a multivariant association involving relaxed significance thresholds, but not the more stringent GWS threshold, possibly indicates joint association with other SNPs. We denote these above loci by gene name, but as for rs11411529, the functional basis for the associations is not understood. Further studies to confirm these associations are needed, and if confirmed, their biological functions may potentially offer a route into understanding heterogeneity of BD.
We tested the validity of sCCA to identify genotype-phenotype relationships by applying it to a random draw of half of the sample.
Our finding of association between rs11411529 and "factor 3" group in the discovery half of the sample was independently replicated by a different analytic method (logistic regression) in the second (independent) half of the sample, supporting the hypothesis that sCCA can identify true associations in the complex datasets, although at present, in genomics terms, the findings are modest and need to be replicated. OPCRIT groups defined by schizophrenia factor analysis 0.063 0.012 "factor 3" group rs11411529 p-Values are obtained by 1000 permutations. sCCA results for GWS schizophrenia SNPs and two types of phenotypes used in the analysis (individual OPCRIT items and OPCRIT groups of symptoms). "Correlation" and "p-value" columns give the best sCCA correlation coefficient and corresponding p-value obtained by 1000 permutations. Columns "phenotypes chosen by sCCA" and "SNPs chosen by sCCA" show phenotypes and SNPs with nonzero weights chosen by the analysis.

| 473
Using a phenomenological approach to group OPCRIT items, we also replicated the association between the cluster of symptoms containing grandiose delusion and delusions of influence and SNP rs11411529, confirming that this association is driven mainly by these two OPCRIT items.
Strengths of this study are the use of a validated assessment tool and the assessment of inter-rater variability (Di Florio et al., 2013); the largest sample size to date with this of granularity of phenotypic information; and phenotypic data obtained from multiple sources including case notes. Limitations are reliance on retrospective assessment of psychosis, the low prevalence of some psychotic symptoms, and missingness. In addition, the sCCA approach may not be the most powerful when genotype-phenotype relationships are nonlinear, and our sample size that while large for this type of study, is still small in the genomics context.
In summary, we show that sCCA approach is capable of revealing relationships between complex phenotype and genotype data, and provide evidence for associations between sets of SNPs and features of the bipolar phenotype. Given sample size limitations, the specific associations are best regarded as hypothesis generating, and require evaluation in other well-phenotyped samples.