Regions of homozygosity as risk factors for multiple myeloma

Abstract Genomic regions of homozygosity (ROH), detectable in outbred populations, have been implicated as determinants of inherited risk. To examine whether ROH is associated with risk of multiple myeloma (MM), we performed whole‐genome homozygosity analysis using single‐nucleotide polymorphism genotype data from 2,282 MM cases and 5,197 controls, with replication in an additional 878 MM cases and 7,083 controls. Globally, the distribution of ROH between cases and controls was not significantly different. However, one ROH at chromosome 9q21, harboring the B‐cell transcription factor gene KLF9, showed evidence of a consistent association and may therefore warrant further investigation as a candidate risk factor for MM. Overall, our analysis provides little support for a homozygosity signature being a significant factor in MM risk.


INTRODUCTION
Multiple myeloma (MM) is a malignancy of plasma cells that has a significant heritable component as evidenced by the two-to fourfold increased risk shown in relatives of MM patients (Altieri, Chen, Bermejo, Castro, & Hemminki, 2006;Palumbo & Anderson, 2011).
Recent genome-wide association studies (GWAS) have provided the first risk alleles for MM identifying common (minor allele frequency [MAF] >5%) single-nucleotide polymorphisms (SNPs) at 17 independent loci (Broderick et al., 2011;Chubb et al., 2013;Mitchell et al., 2016;Swaminathan et al., 2015;Went & Sud, 2018). Although these SNPs individually have modest effects on risk (odds ratio [OR] typically <1.5), their identification has yielded novel insights into MM biology, implicating disruption of B-cell biology and chromatin remodeling pathways in development of the disease (Mitchell et al., 2016).
Whereas much of the heritable risk of MM is likely to be enshrined in such polygenic susceptibility, statistical modeling indicates that a significant fraction of the familial risk will be unexplained (Mitchell et al., 2015). A number of possible explanations for this missing heritability have been proposed including rare and recessive alleles (Manolio et al., 2009;Mitchell et al., 2015). Common regions of homozygosity (ROH), the result of autozygosity (i.e., occurrence of two alleles at the same locus originating from a common ancestor as a consequence of nonrandom mating), have been documented to be detectable at high frequency in outbred populations as a consequence of selective pressure (Ku, Naidoo, Teo, & Pawitan, 2011;Lencz et al., 2007). Searching for ROH on a genome-wide basis therefore affords a strategy for discovery of recessively acting disease genes and has been exploited in studies of rheumatoid arthritis (Yang, Chang, Liang, Lin, & Wang, 2012), Alzheimer's (Ghani et al., 2015), and early-onset Parkinson's disease (Simon-Sanchez et al., 2012).
Findings from these studies support the hypothesis that recessive, disease-predisposing loci not readily detected using a conventional GWAS approach exist (Yang et al., 2012). A possible reason for this apparent failure of conventional GWAS is that the relative risks per locus are too modest and/or that the disease-associated variants are not in strong linkage disequilibrium (LD) with the tag SNPs.
Although GWAS may therefore have limited ability to identify recessive disease alleles through SNP analyses, such data sets can potentially be exploited in the search for recessively acting disease loci through whole-genome homozygosity analysis (WGHA). To explore this prospect in MM, we implemented WGHA using SNP genotype data from two GWAS data sets totaling 3,160 cases and 12,280 controls.

Discovery GWAS
For the discovery analysis, we analyzed genotype data from a previously described (Broderick et al., 2011) GWAS, which comprised 2,282 MM cases ascertained through the UK Medical Research Council (MRC) Myeloma-IX trial (Morgan et al., 2010). Genotype data on 5,197 individuals from the UK Welcome Trust Case-Control Consortium 2 (WTCCC2) study of individuals from the 1958 British Birth Cohort (58C) (also known as the National Child Development Study) and the UK Blood Service collections served as controls. Cases were genotyped using Illumina Human OmniExpress-12 (Version 1.0 array; Illumina, San Diego, CA) and controls using Illumina Human 1-2M-Duo Custom (Version 1 array chips).

Replication GWAS
For replication, we utilized genotype data generated on an additional series 878 MM cases ascertained through the UK MRC Myeloma-IX and Myeloma-XI trials. Controls consisted of (1) 2,976 cancer-free men recruited by the PRAC-TICAL Consortium-the UK Genetic Prostate Cancer Study (age <65 years), a study conducted through the Royal Marsden NHS Foundation Trust and SEARCH (Study of Epidemiology & Risk Factors in Cancer Heredity), recruited general practices in East Anglia (2003)(2004)(2005)(2006)(2007)(2008)(2009), and (2) 4,446 cancerfree women across the United Kingdom, recruited via the Breast Cancer Association Consortium. Both cases and controls were genotyped using Illumina OncoArray.

Ethics
Collection of patient samples and associated medical information was undertaken with ethical review board approval, specifically, for the Myeloma-IX trial, by the MRC Leukaemia Data Monitoring and Ethics Committee (MREC 02/8/95, ISRCTN68454111), and for the Myeloma-XI trial, by the Oxfordshire Research Ethics Committee (MREC 17/09/09, ISRCTN49407852). The diagnosis of MM (ICD-10 C90.0) was established in accordance with World Health Organization guidelines.

Quality control
Both GWAS have previously been subject to stringent quality control (Broderick et al., 2011;Chubb et al., 2013;Mitchell et al., 2016;Swaminathan et al., 2015). Briefly, samples with a low SNP call rate (<95%) were excluded, as were SNPs where <95% of DNA samples generated a genotype at the locus. We excluded individuals of non-European ancestry (using the HapMap Version 2 CEU, JPT/CHB, and YRI populations as reference). For apparent first-degree relative pairs, we excluded the control from a case-control pair; otherwise, we excluded the individual with the lower call rate. SNPs with a MAF <1% or displaying significant deviation from Hardy-Weinberg equilibrium (p < 10 −5 ) were excluded. Genotypes were imputed to >10 million SNPs using IMPUTE2 (Version 2.3) (Howie, Donnelly, & Marchini, 2009) software in conjunction with a merged reference panel consisting of data from the 1000 Genomes Project (Abecasis et al., 2010) (phase 1 integrated release March 3, 2012) and UK10K, retaining only alleles with an INFO score of >0.8.

Identification of ROH
PLINK (Version 1.9) (Purcell et al., 2007) was used to search for ROH with a specified length, inputted in the homozyg-SNP parameter, of 73 for the discovery data set and 100 for the replication data set. This resulted in an ROH length that was more than an order of magnitude larger than the mean haploblock size in the human genome without being so large as to be very rare. The likelihood of observing, in the case of the discovery data set, 73 consecutive chance events can be calculated on the basis that mean heterozygosity in the controls was 35%. Given 408,422 SNPs and 7,478 individuals, a minimum length of 57 would be required to produce <5% randomly generated ROH across all subjects (i.e., (1-0.35) 57 × 408,422 × 7,478 = 0.047). A consequence of LD is that the SNP genotypes are not always independent. Analysis based on PLINK's pairwise LD SNP pruning function revealed 311,773 separable tag groups, representing a 24% reduction of information compared to the original number of SNPs. Thus, ROH of length 73 were used to approximate the degrees of freedom of 57 independent SNP calls.
Following this, an ROH of ≥73 SNPs in length were pruned to only those that occurred in more than 10 individuals. To ensure that a minimum length and number of SNPs in each ROH was maintained, each individual's SNP data were recoded as one if the SNP was in an ROH for that individual F I G U R E 1 Cumulative distributions of regions of homozygosity (ROH) in cases and controls of (a) discovery, and (b) replication data sets.
Each data point represents the cumulative fraction of the samples with the corresponding minimum run of homozygosity and zero otherwise. Each SNP present in fewer than 10 individuals were then recoded to zero before any ROH that were subsequently less than the required number of SNPs in length were removed. This resulted in a list of "common" ROH with a minimum of 73 consecutive ROH calls across 10 or more samples. To investigate association of ROH at a locus, we considered those ROHs with p-value <0.01 in the discovery data set.

Statistics and bioinformatics
We detected ROH using PLINK (Version 1.9), which views a sliding window of SNPs across the genome. We examined broad-sense ROH, which considers a homozygosity-rich region containing a small proportion of heterozygotes, the result of genotyping errors, artificial heterozygosity, or mutations. Specifically, our analysis allowed for 2% heterozygotes in each window to prevent an underestimation of the number and size of ROH. The remaining options were set to default values (including allowing five missing calls per window), with the exception of the homozyg-SNP parameter that was calculated for each data set according to the aforementioned method for defining the ROH length. Statistical analyses were performed using R (Version 3.4) and data was formatted using Perl code. A χ 2 test was used to compare the distribution of categorical variables between cases and controls. To compare the difference in average number of ROH between cases and controls, we used a Student's t-test. Meta-analyses were performed in R (Version 3.3.1) using the fixed-effects inversevariance method.
To provide evidence of positive selection at each locus, we obtained iHS, D, and F st metrics from dbPSHP (Li et al., 2014). iHS measures the extent of LD surrounding a locus, with an extreme absolute value being indicative of a positively selected allele (Voight, Kudaravalli, Wen, & Pritchard, 2006). Tajima's D metric infers selection by quantifying the difference between number of segregating sites and average number of nucleotide differences, with a large difference being indicative of selection (Tajima, 1989). F st assesses the proportion of genetic diversity as a result of allele frequency differences within a sample population versus the entire population (Holsinger & Weir, 2009).

RESULTS
After imposing stringent quality control to GWAS data sets, the discovery GWAS provided genotype data on 2,282 cases and 5,197 controls, and the replication GWAS genotype data on 878 cases and 7,083 controls (Supplementary Tables 1  and 2).
Using PLINK, we found a total of 393 ROH in the discovery data set (Supplementary Table 3, Supplementary  Figure 1). The mean number of ROH was lower in cases than in controls (p = 0.016), although the mean total length of ROH present was not significantly different (cases 232 Mb, controls 230 Mb; p = 0.15; Table 1). The cumulative length of ROH in cases and controls was computed to examine the cumulative distribution in each (Figure 1). ROH encompassing the centromere were seen on chromosomes 1, 2, 3, 4, 5, 6,7,8,10,11,12,16,17,18,19, and 20, including a previously described ROH (Lencz et al., 2007) that bounded the centromere on chromosome 8 and hence served as a control. The longest ROH (44.2 Mb) was the one spanning the centromeric region of chromosome 1. Six regions showed a difference in frequency between cases and controls at a p < 0.01 level of significance (Table 2).
We sought validation of these findings in the replication GWAS data set. Overall, 362 ROH were found in the replication data set (Supplementary Table 4). As was the case with the discovery GWAS data set, ROH that encompassed the centromere were identified on chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 16, 17, 18, 19, and 20, with the longest ROH (46.4 Mb) spanning the centromere of chromosome 1. The average length of ROH in the replication data set (5.65 Mb), was comparable with that found in the discovery (5.07 Mb) series. In contrast to the discovery series, the average number of ROH in the replication data set was not significantly different between cases and controls (Table 1). Furthermore, the mean total length of ROH was not significantly different (cases: 355 Mb, controls: 360 Mb; p = 0.12; Table 1). The cumulative distribution provided a demonstration of the T A B L E 2 Results of regions of homozygosity (ROH) associated with multiple myeloma in the discovery data set (p < 0.01) and replication data set Note. iHS, D, and F st metrics were obtained from dbPSHP (Li et al., 2014). P Meta is calculated using an inverse-variance weighted method. comparability in distribution of ROH between cases and controls ( Figure 1). Of the six candidate ROH linked to MM risk in the discovery GWAS, two ROH located at 9q21 and 10q22 showed a consistent direction of effect, with 9q21 having a comparable frequency ( Table 2). The combined p-values for the consensus ROH regions at 9q21 and 10q22 (Figure 2), were 0.01 and 0.18, respectively (Table 2). Although the 9q21 association is not significant after adjusting for multiple testing, such correction can result in a type II error. Support for positive selection of this region is provided by the associated high D and iHS metrics (Table 2). Intriguingly, the consensus region for the ROH at 9q21, contains the gene encoding the transcription factor KLF9. Decreased levels of KLF9 have been implicated in a hyperproliferative B-cell response and defective Ig production (Good & Tangye, 2007;Savignac et al., 2010) and a predictive marker of sensitivity to bortezomib treatment in MM (Mannava et al., 2012;Ri, 2016).

DISCUSSION
Recent studies have suggested that analysis of genomic ROH may provide evidence of cancer risk (Assie, Laframboise, Platzer, & Eng, 2008;Thomsen et al., 2016aThomsen et al., , 2016bWang et al., 2013). Specifically, that interrogation of locus-specific ROH can provide a strategy to identify recessive diseasecausing variants. Generally, the strategy has been applied to populations characterized by an increased degree of inbreeding (Feldman, Lee, & Seligman, 1976), although there is evidence that the success in such populations may be attributed to confounding factors (Abdellaoui et al., 2013(Abdellaoui et al., , 2015Ceballos, Joshi, Clark, Ramsay, & Wilson, 2018;Mitchell et al., 2016). More recently, the strategy has been applied in outbred populations; however, few of the reported analyses have sought replication of study findings. It is notable that in a recent WGHA of schizophrenia risk, attempts to validate previous findings have failed despite a well-powered replication analysis (Mitchell et al., 2016).
Here, we have sought to mitigate against confounding factors or the winner's curse (Palmer & Pe'er, 2017) by investigating the homozygosity signature of SNP genotype data first in a discovery and then in an independent replication data set, totaling 3,160 MM cases and 12,280 population-matched controls. A further strength of our analysis is the stringent quality control applied to each of the data sets; notably, both had no demonstrable evidence of population stratification. Finally, to search for meaningful ROH and avoid inflation because of correlated SNPs, we adjusted computed ROH length to account for the fraction of SNPs that were in tagged regions.
Our discovery WGHA and replication provided no evidence of an association between global or individual ROH and MM risk. While not statistically significant after correction for multiple testing, an ROH at 9q21 showed a consistent direction of effect. The consensus region at 9p21 implicates the transcription factor KLF9, which, when given F I G U R E 2 Continued a plausible biological role in MM pathogenesis, may warrant further investigation.
In conclusion, our analysis provides little support for a homozygosity signature being a significant factor in MM risk. However, we cannot exclude the possibility of recessive alleles influencing MM risk.

ACKNOWLEDGMENTS
In the United Kingdom, Myeloma UK and Bloodwise provided principal funding. Additional funding was provided by Cancer Research UK (C1298/A8362 supported by the Bobby Moore Fund) and The Rosetrees Trust. M.W. is supported by funding from Mr. Ralph Stockwell. A.S. is supported by a clinical fellowship from Cancer Research UK and the Royal Marsden Haematology Research Fund. This study made use of genotyping data on the 1958 Birth Cohort generated by the Wellcome Trust Sanger Institute (http://www.wtccc.org.uk). We also thank the staff of the CTRU University of Leeds and the NCRI Haematology Clinical Studies Group. We thank the High-Throughput Genomics Group at the Wellcome Trust Centre for Human Genetics (funded by Wellcome Trust Grant reference 090532/Z/09/Z) for the generation of UK myeloma Oncoarray data. We are indebted to the patients and their clinicians who participated in the sample and data collection.

CONFLICTS OF INTEREST
The authors declare no conflict of interest.