Structural Genomic Variation as Risk Factor for Idiopathic Recurrent Miscarriage

Recurrent miscarriage (RM) is a multifactorial disorder with acknowledged genetic heritability that affects ∼3% of couples aiming at childbirth. As copy number variants (CNVs) have been shown to contribute to reproductive disease susceptibility, we aimed to describe genome-wide profile of CNVs and identify common rearrangements modulating risk to RM. Genome-wide screening of Estonian RM patients and fertile controls identified excessive cumulative burden of CNVs (5.4 and 6.1 Mb per genome) in two RM cases possibly increasing their individual disease risk. Functional profiling of all rearranged genes within RM study group revealed significant enrichment of loci related to innate immunity and immunoregulatory pathways essential for immune tolerance at fetomaternal interface. As a major finding, we report a multicopy duplication (61.6 kb) at 5p13.3 conferring increased maternal risk to RM in Estonia and Denmark (meta-analysis, n = 309/205, odds ratio = 4.82, P = 0.012). Comparison to Estonian population-based cohort (total, n = 1000) confirmed the risk for Estonian female cases (P = 7.9 × 10−4). Datasets of four cohorts from the Database of Genomic Variants (total, n = 5,846 subjects) exhibited similar low duplication prevalence worldwide (0.7%–1.2%) compared to RM cases of this study (6.6%–7.5%). The CNV disrupts PDZD2 and GOLPH3 genes predominantly expressed in placenta and it may represent a novel risk factor for pregnancy complications.


Recurrent miscarriage cases and fertile control samples
The study subjects from Estonia were recruited at the Women's Clinic of Tartu University Hospital and Nova Vita Clinic, Tallinn, Estonia since 2003. The Danish subjects have been recruited at the Danish Recurrent Miscarriage Clinics, Copenhagen and Aalborg, Denmark since 1986. In total, the current study included 558 RM patients (80 female and 39 male partners from Estonia; 229 female and 210 male cases from Denmark) (Supp. Table S1). All recruited patients had a normal karyotype tested from peripheral blood lymphocyte cultures and known clinical risk factors of RM had been excluded. Female patients had normal menstrual cycles, no uterine anomalies (by hysterosalpingography, hydrosonography or hysteroscopy) and no antiphospholipid syndrome. The female patients were further subphenotyped as either primary (Estonia, n = 46, median age 29.5 years, 5 -95th% 20. 3 -36.8 years; Denmark, n = 113, 29.0 years, 22.0 -36.8 years) or secondary RM (Estonia, n = 34, median age 31.0 years, 5 -95th% 23.0 -37.0 years; Denmark, n = 116, 30.0 years, 23.8 -39.0 years) based on the occurrence of consecutive miscarriages either before (if any) live births or following one or more live births, respectively.
The study was conducted according to the Declaration of Helsinki principles and a written informed consent was obtained from each individual prior to recruitment and collection of blood samples for DNA extraction.

Population-based cohort samples from Estonian Biobank (EGCUT)
The carrier status of the identified RM-associated risk CNV at 5p13 was additionally determined for population-based cohort samples (n = 1000; 504 men, 496 women) drawn from the Estonian Biobank. The cohort (18 years of age and up) closely reflects the age, sex and geographical distribution of the Estonian population. Studies on the genetic structure of Europe have placed Estonia in the proximity of Northern and Eastern European populations (Nelis et al., 2009). As the EGCUT study protocol incorporates information on self-reported experience of spontaneous miscarriages of the recruited subjects and does not include confirmed medical diagnosis of recurrent miscarriage disease, the self-reported data was not addressed in this study. The entire project is conducted according to the Estonian Gene Research Act and all of the participants have signed the broad informed consent.

Discovery phase: genome-wide SNP genotyping and CNV detection
In the discovery phase, proportionally one-third of Estonian sample (n = 70; fertile controls, n = 27; RM cases, n = 43; Supp. Table S2) was genotyped using Illumina Human370CNV-Quad SNP array (Illumina) (Genotyping Core Facility, Estonian Biocentre) with SNP call rate >99.4% for all samples (median 99.8%). For each sample, calling of CNVs from the resulting genome-wide genotyping data was performed in parallel with two Hidden Markov Model-based algorithms. Normalized signal intensity data was analysed with QuantiSNP program (Colella et al., 2007) calling CNVs independently for each individual. The adjustment for 'genomic waves' in signal intensities (Diskin et al., 2008) was turned on with the '--doGCcorrect' key and '--emiters 25' together with '--Lsetting 1000000' were added to the default calling parameters. A parallel analysis was performed with the same genotyping data applying PennCNV algorithm (Wang et al., 2007) and using an Estonian population specific B-allele frequency file (PFB file, calculated from 1000 EGCUT samples) as a reference dataset (Nelis et al., 2009). The default parameters and adjustment of genomic waves in signal intensities was applied as in QuantiSNP analysis. The initial CNV calls from the two algorithms were merged and only CNVs that were called by both algorithms for the same individual in the same genomic loci were considered in the subsequent analysis. As demonstrated previously, the advantage of using the intersection of CNV predictions by more than one algorithm effectively minimizes the number of false positive calls (Winchester et al., 2009;Pinto et al., 2011;Kim et al., 2012). CNVs with QuantiSNP log Bayes Factor (LBF value) <5 and/or rearrangements shorter than 250 bp were excluded from the resulting list of CNVs. For the EGCUT population-based cohort samples the microarray data was processed in a similar manner using parallel CNV calling by QuantiSNP and PennCNV.

Conversion of genomic coordinates
Following the CNV detection in the discovery phase and descriptive statistics of identified CNVs and CNVRs, we transferred all CNVR breakpoint coordinates from the reference genome assembly (NCBI36/hg18) to the latest version of human reference sequence (GRCh37/hg19) in order to acquire the up-to-date genomic annotation data. Conversion of coordinates was performed with the liftOver web-based batch conversion tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver) and using default parameters. For 15 regions in total, the conversion was not possible, as these regions were duplicated, deleted or split in the newer version of the human reference genome assembly.

Functional enrichment analysis
The list of genes disrupted by CNVRs was acquired for the sample set of RM patients as well as for fertile controls using Ensembl PERL API to query the Ensembl database (version 69; http://www.ensembl.org/index.html). The query was extended by 10 kb on either side of the CNVRs in order to account for possible effects on the regulatory regions within immediate proximity of the involved genes (Pinto et al., 2010). Functional enrichment analysis of subsequent gene sets was carried out separately for fertile controls and all RM patients using g:Profiler gGOSt web-based software (http://biit.cs.ut.ee/gprofiler/) (Reimand et al., 2007;Reimand et al., 2011). Considering all genes in user-provided gene lists or chromosomal regions, g:Profiler performs statistical gene set enrichment analysis to find which functional groups and/or biological pathways are significantly overrepresented among user-provided genes, often helping with biological interpretation of high-throughput experiments. Results of Gene Ontology (GO) and Reactome (REAC) datasets with up to third relative hierarchy level were taken into account and enrichment for functional terms was considered significant if the multiple testing corrected enrichment P-value was <0.05. A reciprocal functional enrichment analysis in either RM cases or fertile controls was undertaken for functional terms specifically enriched in one of the study groups to determine the group-specific level of enrichment.

Experimental copy number estimation of prioritized CNVRs using TaqMan qPCR
In order to define the DNA region within the prioritized CNVRs most informative for experimental copy number estimation and placement of TaqMan qPCR assays, the genomic range of each selected rearrangement was narrowed down by identifying the minimal region overlapping between the CNV carriers based on the SNP array CNV calling data. For experimental testing, seven of the prioritized CNVRs (sized 294 bp to 49.6 kb; median 10.4 kb) were targeted with one TaqMan qPCR assay, whereas two largest CNV loci, a 52.4 kb duplication at chromosomal position 5p13.3 and 84.1 kb deletion at 14q32.33, were addressed with two assays in parallel (Supp.  Biosystems). Half of the samples on each plate were controls and half were RM patients in random order. Each plate included a population specific DNA pool of randomly selected fertile controls (n = 50) from either Estonia or Denmark. The copy number of a target region was determined using relative quantification method by normalization to the reference RNase P and the population-specific DNA pool. The diploid genomic copy number was calculated by multiplying the normalized TaqMan qPCR copy number estimates by two. Due to limitations of TaqMan qPCR assay to accurately determine very high diploid copy numbers, individuals with estimated locus copy number larger than four were assigned into copy number class '>4 copies per diploid genome'.

Experimental confirmation of the genomic position and range of 5p13.3 CNV
The confirmation of the 5p13.3 duplication endpoints estimated by SNP array was performed with four EvaGreen qPCR assays (EvaGr assays 1-4; Supp. Table S6; Supp. Figure S1) flanking the predicted breakpoints. Singleplex amplification reactions were run using 0.2 -0.4 µM primers, HOT FIREPol ® EvaGreen ® qPCR Mix Plus (ROX) (Solis Biodyne) as per manufacturer's instructions and 10 ng of genomic DNA of samples with known 5p13.3 copy number based on the data of TaqMan qPCR copy number typing. Each copy number class (2, 3 or 4 copies per diploid genome) was represented by three individuals, except for the >4 copies/diploid genome identified in only one carrier. A reference gene albumin (ALB; MIM# 103600) was amplified in parallel with target loci for every sample under the same conditions and on the same plate (primers in Supp. Table S6). All reactions were run in triplicate and detected on ABI Prism 7900HT Sequence Detection System (Applied Biosystems). Copy number was estimated using absolute quantification method and normalized to the reference gene ALB and population-specific pool of control DNAs.

Fine-mapping of the 5p13.3 rearrangement
The exact position of duplication breakpoint junction in three 5p13.3 duplication carriers was determined using DNA sequencing by primer walking and targeting breakpoint junction region (5.8 kb) defined based on EvaGreen mapping. The sequencing was performed using ABI Prism dGTP BigDye Terminator v3.0 Ready Reaction Cycle Sequencing Kit (Applied Biosystems) (junction-specific primers provided in Supp.

Expression profile analysis of PDZD2 and GOLPH3 genes
The expression analysis was performed using human tissue cDNA panels Human MTC panel I (Lot no 7080213) and Human MTC panel II (Lot no 8030311A) (BD Biosciences Clontech, Palo Alto, CA). The Human MTC panel I consists of human cDNA samples from brain (pool of n = 2 samples), heart (n = 3), kidney (n = 5), liver (n = 3), lung (n = 4), pancreas (n = 15), placenta (n = 8) and skeletal muscle (n = 7). Human MTC panel II is compiled of human cDNAs from colon (n = 5), ovary (n = 15), peripheral blood leucocytes (information not available), prostate (n = 98), small intestine (n = 32), spleen (n = 3), testis (n = 45) and thymus (n = 9). The expression profile of the GOLPH3 and PDZD2 genes was determined with TaqMan qPCR approach using pre-designed TaqMan Gene Expression Assays (Applied Biosystems). The total volume of the duplex amplification reactions was 10 µl and included 1 µl of cDNA, 0.5 µl of HPRT (MIM# 308000) TaqMan assay used as a reference, 0.5 µl of a target-specific TaqMan assay (GOLPH3, Hs00223239_m1; PDZD2, Hs01054836_m1) and HOT FIREPol Probe qPCR Mix Plus (ROX) (Solis Biodyne) as per manufacturer's instructions. All reactions were run in triplicate and detected using ABI Prism 7900HT Sequence Detection System (Applied Biosystems). Relative expression of target genes was determined by normalization to the reference transcript of HPRT.

Copy number assignment of TaqMan qPCR values
The copy number assignment for simple deletion and/or duplication polymorphisms at 2p11.2 and 4q25 CNVRs was performed manually due to clear clustering of TaqMan qPCR values to distinct copy number classes (Supp. Figure S5). For the multicopy 5p13.3 locus, average copy number ratio of two TaqMan assays located within the rearranged region was used for the assignment of each sample into a distinct copy number cluster with k-means clustering method in the statistical package R (ver. 2.15.0; http://www.R-project.org/). The number of clusters to be used was inferred from the number of peaks in the averaged copy number distribution data (Supp. Figure S2). Three clusters (representing diploid copy numbers 2, 3 and 4) were used for the analysis of Estonian and four clusters (diploid copy numbers 2, 3, 4 and >4) for Danish samples.
(A) Genomic context of 5p13.3 involving PDZD2 and GOLPH3 genes based on UCSC database (hg19). The opposite transcription of the PDZD2 and GOLPH3 genes is indicated with blue and green arrows, respectively. DGV Struc Var, structural variation data from the Database of Genomic Variants. (B) Mapping of duplication endpoints using TaqMan qPCR and EvaGreen qPCR assays. Yellow and red arrowed lines indicate the alternative duplication regions predicted by SNP array. Blue and green boxes in the schematic representation of genes denote the coding regions of PDZD2 and GOLPH3, respectively, with exon numbers given above. Locations of qPCR assays are indicated with black arrows. Altogether, four EvaGreen qPCR assays (EvaGr assays 1 -4) and one TaqMan qPCR assay in the PDZD2 gene (Hs02781158_cn; Supp. Table S5) bordering with the predicted duplication breakpoints were used for experimental fine-mapping. Copy number values are given as mean ± SD of three representative individuals in each copy number class (2, 3 or 4 copies/diploid genome), except for copy number class of >4 copies with only one individual available for testing.
Supp. Figure S2. Distribution of unrounded copy number estimations for PDZD2:GOLPH3 duplication based on TaqMan qPCR in Estonian and Danish RM cases (n = 119 and n = 439, respectively) and fertile controls (n = 90 and n = 115, respectively). All study subjects were further tested with the duplication junction-specific PCR (BP-PCR) to confirm the presence (indicated with green bars) or absence (black bars) of the tandem duplication. Based on this double estimation, all subjects were divided into subgroups of duplication carriers or noncarriers subsequently used in association testing. Separation of duplication carriers into distinct copy number classes (3, 4 or >4 copies per genome) is indicated (see also Supp. Materials and Methods). Due to limitations of TaqMan qPCR assay to accurately determine very high diploid copy numbers, individual with estimated locus copy number larger than four was assigned into copy number class '>4 copies per diploid genome'.
Supp. Figure S3. Cumulative length of all CNVs per individual among the discovery phase Estonian RM cases (n = 43) and fertile controls (n = 27). The boxes represent the 25th and 75th percentiles of the data, whereas the whiskers cover the extent of the data of up to 1.5x interquartile range. The median value is denoted as the line bisecting the boxes. Study subjects with outlier values are indicated with circles. RM, recurrent miscarriage; F, female; M, male.
Supp. Figure  Supp. Figure S5. Comparative copy number estimations for prioritized CNV regions predicted based on SNP microarray genotyping data and precisely determined with TaqMan qPCR in the Estonian discovery sample (n = 70). SNP microarray prediction data is given according to QuantiSNP CNV calling results due to higher accuracy in copy number estimation compared to PennCNV in our dataset. Each CNV region was targeted with one TaqMan assay, except for the largest CNVR in the group, 5p13.3 (52.4 kb, predicted based on SNP array data), targeted with two assays within the locus (Supp . Table S5). Additional CNV carriers detected with TaqMan qPCR but unidentified by SNP microarray-based CNV calling are highlighted in red. TaqMan qPCR copy number values are unrounded and given per diploid genome.
Supp. Figure S6. Previously reported CNVs and segmental duplications for the complex prioritized CNV regions not carried on to the genetic association study in the full Estonian sample set (Supp. Table S4). The CNV data is based on the Database of Genomic Variants (DGV) as presented in the UCSC browser (http://genome.ucsc.edu/cgi-bin/hgGateway).
Chromosomal positions are given according to hg19. Red CNV tracks represent deletions, blue duplications, brown indicate both deletions and duplications in size relative to the reference and purple tracks denote inversions.
Supp. Figure (DGV). Presence in the DGV was considered as confirmation of the prioritized CNVR as true CNV locus. g Although confirmed as copy number variable by TaqMan qPCR, the CNVR was excluded from further study due to inconsistent copy number data likely due to complex genomic context of the rearrangement (Supp. Figure S6). h CNV regions characterized by both deletion or duplication events in studied subjects. i Two TaqMan qPCR assays were applied for the quantification of the two largest regions among the prioritized CNVRs.