Detecting rare copy number variants from Illumina genotyping arrays with the CamCNV pipeline: Segmentation of z‐scores improves detection and reliability

The intensities from genotyping array data can be used to detect copy number variants (CNVs) but a high level of noise in the data and overlap between different copy‐number intensity distributions produces unreliable calls, particularly when only a few probes are covered by the CNV. We present a novel pipeline (CamCNV) with a series of steps to reduce noise and detect more reliably CNVs covering as few as three probes. The pipeline aims to detect rare CNVs (below 1% frequency) for association tests in large cohorts. The method uses the information from all samples to convert intensities to z‐scores, thus adjusting for variance between probes. We tested the sensitivity of our pipeline by looking for known CNVs from the 1000 Genomes Project in our genotyping of 1000 Genomes samples. We also compared the CNV calls for 1661 pairs of genotyped replicate samples. At the chosen mean z‐score cut‐off, sensitivity to detect the 1000 Genomes CNVs was approximately 85% for deletions and 65% for duplications. From the replicates, we estimate the false discovery rate is controlled at ∼10% for deletions (falling to below 3% with more than five probes) and ∼28% for duplications. The pipeline demonstrates improved sensitivity when compared to calling with PennCNV, particularly for short deletions covering only a few probes. For each called CNV, the mean z‐score is a useful metric for controlling the false discovery rate.

Genotyping experiments with large case-control studies have been effective at identifying single-nucleotide polymorphisms (SNPs) associated with the disease.Associations with common copy number variants (CNVs) can also be detected by the imputation of their genotypes using reference datasets with sequence data.The 1000 Genomes Project reported that 73% of structural variants with greater than 1% frequency are in strong linkage disequilibrium (r 2 > 0.6) with nearby SNPs, although the proportion falls to 44% for common duplications (Sudmant et al., 2015).For example, at two genome-wide association study loci associated with breast cancer, the APOBEC locus (Michailidou et al., 2017;Nik-Zainal et al., 2014) and 2q35 (Wyszynski et al., 2016), imputed deletions tagged by SNPs are associated with risk, and the CNV has been identified as the probable causal variant in each case.However, with current imputation reference panel sizes, it is not possible to reliably impute rare CNVs.With increasing sample sizes genotyped on arrays, there is an opportunity to detect novel associations with rare CNVs that potentially have large effects on disease risk.
Disease associations for rare CNVs detected using genotyping arrays have typically involved large CNVs, often covering multiple genes, for example, those associated with developmental disorders (Jacquemont et al., 2011) and schizophrenia (Rees et al., 2014).Evidence from clinical tests using gene panels suggests that intragenic CNVs account for a significant proportion of pathogenic variants, between 4.7% and 35% depending on the clinical area (Truty et al., 2018).There is also evidence that intronic CNVs can affect the regulation of genes where the exons are highly conserved across evolution (Rigau et al., 2019).Such CNVs are likely to cover only a small number of probes on genotyping arrays.
Several methods have been developed for calling CNVs from the intensities of genotyping arrays.The most commonly used method is PennCNV (Wang et al., 2007), which implements a Hidden Markov Model to find increases or decreases in the intensities of probes, ordered by genomic position, for a single sample.Other methods such as DNAcopy (Venkatraman & Olshen, 2007) use circular binary segmentation (CBS) to detect these shifts in intensity by sample.Methods such as CNVTools (Barnes et al., 2008) and cnvHap (Coin et al., 2010) model the intensities from all samples at CNV loci but are designed to analyse more common CNVs.
The pipeline (CamCNV; Figure 1) we describe here uses DNAcopy but also aims to incorporate the information from all samples.It works on a similar principle to the MeZOD method (Kirov et al., 2011;McCarthy et al., 2009) to detect rare outliers in the distribution of intensities.The method excludes probes in common CNV regions and assumes that at each probe tested the vast majority of samples will have two copies of DNA and their intensities will form a normal distribution.Single copy deletions or three-copy duplications will form part of separate distributions that overlap with the two-copy distribution.Converting the intensity measurements to z-scores and segmenting using DNAcopy detects runs of probes that are potential CNVs for each sample.Our method draws on previous work (Cooper et al., 2015) to remove noise and batch effects from the intensities with a principal component adjustment (PCA).We provide estimates for the sensitivity using known CNVs in 1000 Genomes samples and for the false discovery rate (FDR) from a large number of replicates.

| Principal component adjustment
We first export the log R ratio (LRR) intensities for each sample at each probe from Illumina's Genome Studio software into a matrix ordered by probe position.To reduce batch effects in the intensity measurements, we follow the advice of Cooper et al. (2015) and perform a PCA on the LRR.This should also adjust for genomic waves (Diskin et al., 2008).We exclude from the matrix SNPs that fail genotyping quality control (QC) and use the "thin" function from the R bigpca package (Cooper) to select 15% of SNPs most representative of the principal components.We use the function "quick.elbow" to decide how many principal components lie above the "elbow" of the PCA plot and adjust the entire LRR matrix for all SNPs for those PCs using the bigpca PC.correct function.Figures S1-S4 show an example principal component scree plot and the variance of the LRR before and after PCA.

| z-Score calculation
Using all samples we calculate the means and standard deviations for the LRR at each probe and hence convert the LRRs to z-scores as:

| Probe and loci exclusions
Before calling CNVs we exclude probes likely to produce unreliable calls.To exclude probes likely to represent technical failures at the chip genotyping level we exclude those not clustered by the Illumina Gentrain algorithm (Gentrain score < 0.15) and low-intensity probes (Illumina metric AB R Mean < 0.2).As our pipeline is designed to detect rare CNVs, we exclude probes within CNVs reported by the 1000 Genomes Project as having greater than 1% frequency.We observed some false-positive deletion calls at probes within a few base pairs of another probe, presumably because competition for DNA fragments lowers the intensity.To guard against this, we exclude probes within three base pairs of another probe.Finally, we exclude noisier probes with high LRR variance (choosing a threshold of 0.12 − two SDs above the mean variance for all probes).This should exclude probes that fail genotyping in a large enough subset of samples to skew the z-scores for all samples.In total, we exclude ∼10% of probes (Table S1).After CNV calling, we exclude CNVs falling within 1 Mb of the telomeres and centromeres as defined in the "gap" database table from the UCSC Genome Browser.We also exclude CNVs within immune-related loci (major histocompatibility complex, T-cell receptor, and immunoglobulin heavy chain), as CNVs at these loci may occur somatically in blood.

| CBS of z-scores
The z-scores are saved into an R data frame for each chromosome ordered by genomic position with the excluded probes removed.We use the R package DNACopy (Olshen et al., 2004;Venkatraman & Olshen, 2007) to segment the z-scores for each sample individually by chromosome using CBS.After using the "smooth" function to detect outliers and smooth the data, we call the segments using the segment function.The R commands used for sample s in z-score data frame z_df are: CNA. object <-CNA(cbind(z_df[s]),z_df$chr,z_df $positions, data.type="logratio", sampleid=colnames (z_df[s]), presorted = TRUE).
This produces a list of segments for each sample for each chromosome with the start and end positions, the number of probes, and the segmentation mean for each segment.Before deciding the final cut-off values we label as potential CNVs segments with a mean z-score below -2 or above +1.5, covering at least three probes.Rare larger CNVs may bias our comparisons so we exclude segments above an arbitrary upper length limit of 3 Mb or 200 probes.

| LRR shift scores per sample
As an additional check, we calculate the shift in the original LRR (before the PCA) between the baseline for the sample and the mean LRR across the called CNV.For each sample, we calculate a mean LRR figure for each chromosome.To ensure that this measure is based on normal two-copy intensities, we only use common SNPs with a high call rate (>99%) where the sample appears heterozygous (B allele frequency [BAF] around 0.5).Then we calculate the difference between this mean sample LRR and the mean of the LRR observed for that sample cross the probes in the called CNV.From the CNVs that match well-imputed CNVs we observe that greater than 99% of the sample shift scores lie within the range of −0.2 to −0.8 for deletions and 0.1-0.4 for duplications.Thus we exclude any called CNVs outside these ranges.
The BAF can be used alongside the LRR to identify CNVs.For samples with normal two-copy DNA at a probe, this score should be near zero for homozygotes for the reference allele (AA), 0.5 for heterozygotes (AB), and one for homozygotes for the alternate allele (BB).For deletions, the BAF has some value as a negative predictor; any values not close to zero (A−) or one (B−) suggest that it is not a true deletion.Thus, for each potential deletion (as an exclusion score) we calculate the percentage of probes with BAF greater than 0.3 and less than 0.7.We exclude deletions in which ≥25% of probes fail this check.
For three copy duplications, scores should be around zero (AAA), one (BBB), 0.33 (AAB) or 0.66 (ABB), so we can record a positive predictor score as the percentage of probes with BAF greater than 0.1 and less than 0.4 or greater than 0.6 and less than 0.9.However, for short CNVs the value of the BAF is limited as we do not necessarily expect the CNV to cover probes that might be polymorphic, particularly on arrays targeting rare variants.

| Breast Cancer Association Consortium samples
To develop the pipeline we used data from 57,796 breast cancer cases and 48,209 controls of European ancestry from 54 studies (Table S2) in the Breast Cancer Association Consortium (BCAC) dataset.These were genotyped on the Illumina OncoArray chip (533,631 probes) at five genotyping centres (Amos et al., 2017).In addition, we included 1661 pairs of duplicate samples.Although the majority of duplicates were genotyped intentionally by studies for QC, 160 of the pairs were submitted by different studies and later identified as the same person from their genotypes and date of birth.The PCA described above was performed separately on each study.For eight studies the number of samples was too large to adjust as a single batch and these studies were divided into two or three batches.

| 1000 Genomes Project controls
As part of the OncoArray genotyping QC, samples from the 1000 Genomes Project were added to each plate at the genotyping centres.In addition, 240 samples were genotyped by Illumina.After sample QC we used OncoArray data for 233 individuals comprising 2226 replicates.A total of 169 individuals were genotyped once and 64 were genotyped more than once (between 2 and 349 times).We downloaded the structural variant calls from Phase 3 of the 1000 Genomes Project (Sudmant et al., 2015), extracted deletions and duplications for the 233 comparison individuals and calculated the start and end probes for any variant covered by OncoArray probes.There are some complex variants (e.g., one individual has a series of 60 deletions at the end of chromosome 18).To simplify the tests we excluded sets of CNVs where a sample had five or more CNVs starting in the same megabase and also used an arbitrary cut-off of 50 probes to exclude very long variants.

| Sample QC
For the BCAC dataset we included only DNA samples extracted from whole blood to minimise variation in sample quality.1000 Genomes DNA samples were from lymphoblastic cell lines.Samples with genotype call rates below 99% or excess heterozygosity or sex chromosome anomalies were excluded (Michailidou et al., 2017).We calculate a derivative log ratio spread (DLRS) figure for each sample as the SD of the differences between LRR for probes ordered by genomic position, divided by the square root of two.This measures the variance from each probe to the next averaged over the whole genome and so it is insensitive to fluctuations in intensities between different chromosomes for individual samples (Cooper et al., 2015).We excluded samples with DLRS 3.5 SDs above the DLRS mean for the study.This exclusion was applied both before and after the PCA.We used the remaining samples to generate the z-scores and call CNVs.
After CNV calling, to exclude samples that may generate multiple false calls with our method, we counted the number of short segments (between 3 and 200 probes) for each sample.We observed that the distribution of segment counts is skewed to the right with many noisy samples with an excessive of number of segments (Table S3).We calculated an exclusion cut-off for the maximum number of segments using the following formula where x is the segment count for each sample (based on the rationale that the true number of segments should be approximately Poisson): = 2 × sqrt( ) cut-off = median( ) + 3.5 .

| Imputed CNVs
BCAC genotypes were phased using ShapeIT (Delaneau et al., 2013) and imputed to the 1000 Genomes Phase 3 reference panel using Impute2 (Howie et al., 2012) as previously described (Michailidou et al., 2017).We compared called CNVs for samples with imputed dosages of one, that is, those imputed to have one or three copies at the CNV locus.

| PennCNV calling
As a comparison, we also called all samples using PennCNV with the LRR adjusted for principal components as the input.PennCNV was run with the following options: detect_cnv.pl-test -hmm lib/hhall.hmm-pfb./pfb_files/pfb_onco.txt-conf -gcmodel onco.gcmodelGC model files were calculated for the OncoArray giving the percentage of G or C bases for 500 Kb either side of each probe.Population-specific (PFB) files were calculated to give the frequency of the B allele for each SNP.After calling, the PennCNV script clean_cnv.plwas used to merge neighbouring CNVs where the gap between them was less than 20% of the total length of the combined CNVs.The PennCNV script filter_cnv.pl was used to generate QC statistics for each sample.Samples with LRR SD greater than 0.2 or an excessive number (greater than 3.5 SD) of CNVs were excluded.

| Comparison of CNV calls
In the comparisons below, CNV calls were counted as an exact match if they covered the maximum number of probes on OncoArray for the reference CNV and were no longer.We categorised calls as partial matches if at least 50% of the expected probes were covered and/or where the called CNV extended further, excluding those covering greater than five times the number of probes in the reference CNV.

| Precision of the z-score cut-offs
z-Score cut-offs are determined based on the mean and SD of values in the sample.Thus the value corresponding to a z-score cut-off of K is given by where μ ˆand σ ˆare the estimated mean and variance.In a finite sample size: where N is the sample size.Thus on the normalised scale, the variance of the estimated cut-off is approximately

| Distribution of mean z-scores for imputed CNVs
To evaluate the distribution of mean z-scores for segments matching imputed CNVs, we compared the segments called by the pipeline (Figure 1) that matched a set of wellimputed (imputation r 2 > 0.8) rare CNVs in the BCAC data.We identified 26 distinct deletions and 11 duplications, carried by 7189 and 4874 samples, respectively.We plotted histograms of the mean z-scores of the segments that perfectly matched the imputed CNVs (n = 4881 deletions and n = 1051 duplications) as shown in Figure 2.

| z-Score cut-offs and estimated FDR
To select mean z-score cut-offs to control the FDR, we compared called CNVs in the 1661 pairs of BCAC replicates.We assume a CNV seen in both replicates is likely to be genuine, whereas a CNV seen in only one sample is likely to be an artefact.We observed that the mean z-scores of the unmatched calls from the first of the replicate pairs (Figure 3) are concentrated towards the normal end of the distribution.This suggests that they are mostly false-positive calls at loci with two-copy DNA, although some may be genuine CNVs at probes with a large degree of overlap with the normal distribution (hence harder to call).The relationship between the mean z-score of the segment and the proportion that is unmatched is shown in Figure 3.To provide reasonable control of the FDR, although not impacting the sensitivity too severely, we chose cut-offs at −3.7 for deletions and +2 for duplications; this results in an overall FDR of ∼10% for the remaining deletions and ∼30% for duplications (Figure 4).The FDR does not reduce materially for duplications with more stringent cut-offs.At the other end of the distribution, we observed a small number of extreme values for the mean z-score.These may reflect homozygous deletions or duplications with more than three copies but could also be due to probes failing to bind at all.We, therefore, excluded calls with z < −14 for deletions and z > +10 for duplications.Although this could exclude some true homozygotes, the main focus is on calling call rare CNVs and hence this possibility can be ignored here.Using these cut-offs, the results from the matching exercise are shown in Table 1.Based on exact or partial matches, the overall FDR was 10% for deletions and 28% for duplications.The FDR decreases with increasing numbers of probes as shown in Table 2, particularly for deletions.We observed no significant difference between the unmatched rate for samples from the same study and those collected by different studies (n = 160 pairs) where the DNA has been extracted by different laboratories at different ages and one might expect more variability from batch effects.

| Concordance of imputed and called CNVs
The overall match rate for imputed data using our chosen cut-offs is summarised in Table 3.The match rates for these CNVs will be affected by the uneven distribution of probe coverage (Table 4).The majority of these deletions only have three probes while most of the duplications have more than six.

| Summary of called CNVs
Using the above cut-offs we called a mean of 2.9 deletions (SD = 1.6) and 2.5 duplications (SD = 2.0) per sample in the BCAC dataset.The sample QC checks excluded 14% of the original samples (Table S4), with the proportion ranging from 2% to 46% for individual studies.Using the interview date as a proxy to estimate how long DNA has been stored before genotyping, we observed that a longer storage time is a strong predictor (p < 2e−16 adjusting for study) that a sample will fail QC (Table S5).The proportion of CNVs excluded by the LRR shift and BAF score checks is shown in Table S6.These are likely to exclude a small proportion of genuine CNVs but they appear to be effective at targeting unreliable calls from the higher exclusion rate for unmatched CNVs from the replicate comparisons and among samples with an excessive number of segments.After excluding long CNVs covering more than 50 probes, the mean length was 7.3 probes (SD = 7.1)/ 34 kb (SD = 71 kb) for deletions, and 10.0 probes (SD = 9.6)/70 kb (SD = 109 kb) for duplications.A total of 14% of the probes on the chip were covered by five or more deletions and 22% of probes by more than five duplications.The probes with most CNV calls overlap with previously published CNV regions.For example, 58% of probes in deletions and 39% of probes in duplications with a frequency greater than 0.5% overlap entries in the Database of Genomic Variants (Zarrei et al., 2015;Table S7).

| Detection of known CNVs in 1000 Genomes samples
We compared our pipeline calls for 1000 Genomes samples genotyped on the OncoArray with the CNVs published by the 1000 Genomes Project.We identified 523 rare (less than 1% frequency) deletions and 308 rare duplications that cover at least three probes on OncoArray.Because some CNVs were carried by more than one of our comparison samples and some samples were replicated multiple times in the OncoArray dataset (n = 3532 comparison CNV calls in total), we chose a random replicate to perform the comparisons.The results are shown in Table 5 and the detected CNVs in Table S8.A total of 80% of deletions and 60% of duplications were detected with either a perfect or partial match.CNVs that were not detected at all tended to be short (60% cover only three or four probes) and tended to have a smaller proportion covered by the chip probes.Those detected with 100% precision have a mean 65% (SD = 22%) of base pairs covered compared to a mean of 53% (SD = 25%) for CNVs with no matches.Around 6% of CNVs were detected but were excluded as they were outside the mean z-score cut-offs or failed other QC metrics.Where CNVs were carried by more than three replicates, deletions called in one sample tended to be called in the majority of samples, with a poorer replication rate for duplications.A total of 70% of these detected deletions (n = 63) and 40% of duplications (n = 35) had a greater than 90% replication rate (Figure S5).

| Comparison with PennCNV calling
We also called the 1000 Genomes samples and BCAC replicates using PennCNV.For the 1000 Genomes comparison, we categorised the calling of the 525 deletions and 308 duplications as either "perfect," if the CNV call matched exactly with the known genotype in more than 90% of replicates, and "imperfect" if at least some calls partially matched or the replicate was below 90%. Figure 5 and Table S9 show the comparison between  S6) tend to cover fewer probes: the deletions cover a mean of 3.9 probes (SD = 1.5) although those matched by both methods cover a mean of 8.5 probes (SD = 7.4).For duplications, the figures are 9.1 probes (SD = 5.5) and 14.2 probes (SD = 10.7).We also estimated the FDR for PennCNV using the BCAC replicate samples.The CamCNV FDR was lower than PennCNV for deletions (Table 6) but higher for short duplications.In the replicate samples, CamCNV called around 1.5 times more deletions and ∼1.4 times more duplications than PennCNV, with nearly all the additional deletions having fewer than six probes.

| DISCUSSION
There are a small number of analyses of CNVs called from large genotyping datasets.These have highlighted the importance of detecting CNVs from a small number of probes (Szatkiewicz et al., 2013) and removing artefactual calls (Cooper et al., 2015).Assessing common CNVs with multiple overlapping intensity distributions at each probe requires robust statistical methods (Barnes et al., 2008).The calling problem is simplified for rare CNVs as nearly all samples will belong to the two-copy distribution with a small number of outliers that are potentially one-copy deletions or three-copy duplications.The two-copy intensity distributions vary between probes but the PCA and conversion of raw intensities to z-scores adjusts for this variance.Thus our method incorporates information from all the samples at each probe; this may explain the improvement in sensitivity compared to the PennCNV method, which looks at each sample individually.From known CNVs and replicate samples we can derive the distribution of z-scores for one-copy deletions and three-copy duplications and calibrate the mean z-score cut-offs for segments to provide an acceptable FDR.From the replicate comparison, we observe that calls with more extreme mean z-scores are unlikely to be false positives, so the mean z-score provides an additional metric for the confidence of the calling alongside the number of probes.
A limitation is that the calling for duplications is less reliable than for deletions, as the duplication intensity distribution has a greater degree of overlap with normal copy samples.Also, there remains the possibility that technical artefacts will cause false calls to cluster at particular probes.For rare variant association tests a proportion of false CNV calls scattered across an array is not as detrimental as a cluster of false calls at particular probes, so chip-specific probe QC is an important element of this pipeline.Strict sample QC is also required.The proportion of samples with unreliable calls may vary greatly by study and this method may not work with samples not genotyped from the blood.We have tested the method on another Illumina array (iCOGs), and the mean z-score cut-offs used for OncoArray appeared appropriate for that data as well.Although we use a large number of samples, smaller sample sizes are likely to produce z-score cut-offs that are reasonably stable as they depend only on the precision of the mean and variance.For example, with a sample size of ∼800 and a deletion cut-off of −3.7, the SE of the cut-off would be around ±0.1.

| CONCLUSION
We present a CNV calling pipeline tailored to detect rare CNVs from Illumina genotyping data.For deletions, it demonstrates a high degree of sensitivity while controlling the false call rate and is an improvement on the most commonly used method (PennCNV) particularly in | 245 detecting deletions covering only a few probes.We believe that this method will be valuable in testing for associations with rare CNVs in existing genotyping datasets.

F
I G U R E 2 Mean z-score of the intensity of segments perfectly matching imputed copy number variants in the Breast Cancer Association Consortium data for (a) 26 deletions (4881 matching segments), (b) 11 duplications (n = 1051 segments) F I G U R E 3 Mean z-scores of matching and umatching deletions (a) and duplications (b) in replicate samples.CNV, copy number variant F I G U R E 4 Estimated false discovery rate (FDR) by mean z-score of deletions (a) and duplications (b) in replicate samples T A B L E 4 Probe coverage for CNVs in comparison with imputed data Abbreviations: CNV, copy number variant; QC, quality control.DENNIS ET AL.