Analysis of sequence data to identify potential risk variants for oral clefts in multiplex families

Abstract Background Nonsyndromic oral clefts are craniofacial malformations, which include cleft lip with or without cleft palate. The etiology for oral clefts is complex with both genetic and environmental factors contributing to risk. Previous genome‐wide association (GWAS) studies have identified multiple loci with small effects; however, many causal variants remain elusive. Methods In this study, we address this by specifically looking for rare, potentially damaging variants in family‐based data. We analyzed both whole exome sequence (WES) data and whole genome sequence (WGS) data in multiplex cleft families to identify variants shared by affected individuals. Results Here we present the results from these analyses. Our most interesting finding was from a single Syrian family, which showed enrichment of nonsynonymous and potentially damaging rare variants in two genes: CASP9 and FAT4. Conclusion Neither of these candidate genes has previously been associated with oral clefts and, if confirmed as contributing to disease risk, may indicate novel biological pathways in the genetic etiology for oral clefts.


Introduction
Nonsyndromic oral clefts, including cleft lip with or without cleft palate (CL/P) and cleft palate (CP) alone, are the most common craniofacial malformations in humans. The etiology of oral clefts is complex and heterogeneous with different environmental and genetic factors contributing to risk. Previous linkage and genome-wide association studies (GWAS) have identified multiple genes and regions associated with risk for CL/P. However, it is estimated that these regions only account for 20-25% of the heritability. Improvements in sequencing technology allows us to expand our search for causal variants even further (Beaty et al. 2016). Recently, we have identified a novel, potentially damaging variant in CDH1 in one multiplex CL/P family based on whole exome sequence (WES) data (Bureau et al. 2014). For this analysis, we used WES and whole genome sequence (WGS) data in families with distantly related affected individuals (second or third degree relationships) to identify genes containing shared rare variants.
The goal of this study was to identify novel rare variants shared at the population or the family level that may contribute to risk for oral cleft phenotypes. Our most notable finding came from a single Syrian family where we identified enrichment of nonsynonymous and potentially damaging rare variants in two genes: CASP9 and FAT4. Furthermore, the CASP9 variant was not present in any of the other affected individuals in this study, including the other Syrian families, nor is it reported in any of the major variant databases. Here we provide a summary of our results with a focus on the most interesting findings to assess the possible biological roles of these rare variants in influencing risk for oral clefts.

Ethical compliance
All studies were approved by the local Institutional Review/Ethics Boards and followed the tenets of the Declaration of Helsinki.

Data collection
The multiplex cleft families studied here were originally ascertained and recruited by different studies for linkage analysis. Families were enrolled because they had at least two biological relatives affected with an apparent nonsyndromic oral cleft. Some families have been previously genotyped and included in published linkage analyses (Wyszynski et al. 2003;Field et al. 2004;Marazita et al. 2004Marazita et al. , 2009Schultz et al. 2004;Riley et al. 2007;Mangold et al. 2009), but the specific marker panels varied and provided sparse coverage of the genome. Families were enrolled in studies in Germany, India, the Philippines, and the Syrian Arab Republic. Each study was conducted somewhat differently, but, in general, a patient with nonsyndromic oral cleft was identified, a preliminary family history investigation revealed at least one additional affected relative existed, and the family was evaluated for potential informativeness for linkage studies. Multiplex families identified as informative were enrolled, and both affected and unaffected relatives were consented and recruited. Study participants were examined to confirm their phenotypic status, a DNA sample was collected, and, for some individuals, limited information concerning potential environmental risk factors (such as mother's smoking history during pregnancy) was available. All three types of oral clefts were identified in these families: cleft lip and palate, cleft palate only, and cleft lip only. We also obtained information on the location of the cleft (right, left, bilateral, or midline), and whether it was a complete or incomplete cleft. The specific oral cleft phenotype varied within and between families. Families were selected for this study if DNA samples were available for at least two second or third degree affected relatives who had given informed consent adequate for DNA sequence analysis. The second degree relatives included half-sibs, avuncular, or grandparental pairs, while the third degree relatives included first cousins and great-avuncular pairs). Some more distant affected relatives such as second cousins and first cousins once removed were also included. Due to funding constraints, only affected family members were sequenced in almost all families and parents of affected individuals were not sequenced.
For the WES portion of the study, we sequenced 108 affected individuals from 52 families (four of the families each had three affected individuals sequenced, and the remaining 48 families each had two affected individuals sequenced, four duplicate subjects for quality control, and two unrelated controls from the CEU HAPMAP population (Utah residents with Northern and Western European ancestry from the Centre d'Etude de Polymorphisme Humain (CEPH) data collection)). These multiplex families were from Syrian, Filipino, Indian, and German populations ( unaffected, five families with five affected individuals, two families with four affected individuals and two unaffected, and one family with eight affected individuals). All of these families were from either Syrian or Filipino populations, and all of the unaffected individuals were Filipino. Table 1 shows the country of origin of these families along with individual and family counts noting the country of origin.

Sequence data generation
Whole exome sequencing Exome sequencing and genotyping was done at the CIDR. DNA sequencing was performed on an Illumina â HiSeq 2500 instrument using standard protocols for a 100-bp paired-end run. Six samples were run per flowcell, guaranteeing >90-95% completeness at a minimum of 209 coverage.
Illumina HiSeq reads were processed through Illumina's Real-Time Analysis (RTA) software generating base calls and corresponding quality scores. Resulting data were aligned to a reference genome with the Burrows-Wheeler Alignment (Li and Durbin 2010) (BWA) tool creating a SAM/BAM file. Postprocessing of the aligned data includes local realignment around indels, base call quality score recalibration performed by the Genome Analysis Tool Kit (GATK) (McKenna et al. 2010;DePristo et al. 2011;Van der Auwera et al. 2013), and flagging of molecular/optical duplicates using software from the Picard program suite. Multisample variant calling was performed using GATK 2.0's Unified Genotyper. Variant quality score recalibration (VQSR) was done in GATK 2.0. CIDR required a minimum mean of 89 coverage before calling any single-nucleotide variant (SNV), but the overall coverage averaged 849 across all exons. Further details of this process are provided in the methods section of Bureau et al. (2014).

Whole genome sequencing
WGS on genomic DNA samples was performed by Illumina, Inc. (San Diego, CA, USA) using TruSeq SBS v3 Reagents, HiSeq Control Software (HCS) and RTA on a HiSeq 2000 machine for real-time image analysis and base calling. Genome assembly, genotype calling, and QC filtering was performed using tools in the CASAVA package. Multisample VCF files were generated using VCFtools (Danecek et al. 2011) and were backfilled with custom scripts to include homozygous reference genotypes and depth of coverage. Full details of the sequencing, alignment, and variant calling process are provided in the supporting information of Mathias et al. (2016).

WES data
To find potentially causal variants for oral cleft phenotypes in the WES data, we performed an initial filtering step to remove variants of low quality based on specific metrics (described below), and variants in genes with extremely high variation (Table S1) (Schmidt et al. 2013). Called variants were dropped if they failed the following quality metrics: mapping quality <30, depth <8 or depth >20,000, non-Y SNP call rate <98%, replicate errors occurring in >1 pairs (among six duplicate subjects), monomorphic, and failing both the GATK VQSR filter and an in-house machine learning metric that combines many of these QC measures to estimate the probability of being a low-quality variant. The variant was dropped if this probability of being low quality exceeded 0.70. We then performed variant-based counting steps at the population level and family level to identify potential risk variants for oral clefts. Here we define population as being one of the four ethnic groups (Syrian, German, Indian, and Filipino). For both family and population level analyses, we further filtered out common variants (minor allele frequency >5% in 1000 Genomes Phase I data, or in the dbSNP common variant set) and variants with an alternate allele present in either of the two controls ( Fig. 1).
For the population-specific analyses, we further screened the rare variants identifying those that were homozygous for the alternate allele in at least one case and at least 20% of all cases had the alternate allele in either the homozygous or heterozygous state. This criterion was chosen to allow for some within-population heterogeneity, while enriching for potential recessive candidate variants by requiring at least one individual be homozygous for the rare allele. Since many of the families included in this study exhibit at least some degree of consanguinity, this was deemed a reasonable criterion. For the family-specific analysis, we identified variants that were homozygous for the alternate allele in at least two affected members of any given family. This is a relatively stringent criterion since these families had sequence data available on only two or three affected individuals. We purposely chose more stringent criteria for the family-specific analysis because it is more likely that the risk variant is the same within the consanguineous families that made up a large proportion of the studied families as opposed to within a population. We selected homozygous variants to enrich for potentially causal recessive candidate variants since even distantly related individuals within the same family share a large number of variants by chance. Furthermore, there is a higher level of observed variant sharing than would be expected (most likely due to extensive consanguinity) in some of these families. Two families contained only one affected individual due to sequencing failure of one member and were dropped from the family-specific analysis.
We also incorporated annotation information from wAnnovar (Wang et al. 2010;Chang and Wang 2012;Yang and Wang 2015) to assess our set of interesting genes based on potential pathogenicity, along with gene location (exonic or intronic), and predicted variant function (nonsynonymous or synonymous). Nine sources were used from wAnnovar to assess the potential pathogenicity of each variant: SIFT, Polyphen2 HDIV,

WGS data
In the WGS data, we performed both validation and discovery analyses (Fig. 1). There was very little overlap between individuals in the WES and WGS datasets (only six individuals from two families; Table 1). For our population-specific validation analyses, we assessed genotype counts in the WGS data for variants and genes identified in the WES analyses. Validation in the population-specific analyses was limited to the Syrian and Filipino families. Validation in the family-specific analyses was limited to the two Syrian families that had both WES and WGS data.
As a follow-up to the results from the single variant validation analysis for Syrian Family 1, we further analyzed the WES data to identify genes with potential compound heterozygous individuals (i.e., having two or more rare variants in the same gene). We specifically identified genes where all three affected individuals had more than one exonic, nonsynonymous variant in the WES data. We validated these results in the WGS data by assessing genotypes for all exonic, nonsynonymous variants in the genes identified in the WES analysis. We define validated genes as those with at least one WGS member who was a potential compound heterozygote.
Discovery in the WGS data was limited to the largest Syrian family (Family 1) with sequence data available on eight affected individuals. There are a total of 22 affected individuals in this large, highly consanguineous family. Almost all parents of affected individuals are consanguineous, often related through multiple paths. The eight individuals were chosen for sequencing such that they formed the most distantly related relative pairs to try to limit chance identity by descent allele sharing. After performing the same genotype quality and frequency filtering steps completed for the WES data, we identified variants with at least one family member who was homozygous for the alternate allele and at least seven of the eight family members carrying the alternate allele (this allowed for some within-family heterogeneity, as the results from the WES analysis suggested this might be present). To reduce the number of potential risk variants to those that were more likely to be functional, we only considered exonic, nonsynonymous variants. We also performed a follow-up analysis in the WGS data to assess rare variants in known enhancer and promoter regions of identified candidate genes.

WES single variant discovery analysis
After performing the filtering steps shown in Figure 1, we used wAnnovar to assess the potential function of the most interesting variants. We also included allele frequency information from a new databases created by the Greater Middle East (GME) Variome project (Scott et al. 2016) and the Qatar Genome (Fakhro et al. 2016) for a better assessment of variants found in the Syrian population. Table 2 shows the variants identified in the family-based analyses, along with annotation information. Tables S2-S5 show the results for the population-specific analyses. We excluded all variants in the HLA region, as they have been shown to be highly population-specific (Sanchez-Mazas and Meyer 2014).

WGS single variant validation analysis
We performed several validation analyses for our most interesting findings from the WES analysis in the WGS data. For the population-specific analyses, we were able to perform validation in the Syrian and Filipino populations. All of the population-specific discovery and validation analysis results are shown in Tables S2-S5. In the Syrian population, all of the WGS individuals were homozygous for the reference allele for the two variants identified in the CTSL3P gene. The variant identified in the SYT17 gene was not present in any other subjects (Table S2). For the individuals from the Filipino population, neither of the two variants identified in the WES analysis were present in the WGS data, most likely due to low quality (Table S4).
For the family-specific variants, we were able to perform validation in two of the Syrian families (1 and 3). As previously stated, there was little overlap between the WES and WGS individuals (Table 1), and we removed any overlapping individuals from this validation analysis.
For the eight affected individuals with WGS data in Syrian Family 1, we confirmed the homozygous genotypes observed in the three WES individuals for the nonsynonymous CASP9 variant shown in Table 2. For the five additional affected individuals in this family, one was homozygous for the alternate allele, three were heterozygous, and one was homozygous for the reference allele (Table 3).
For Syrian Family 3, there were five individuals with WGS data, three of which were also in the WES analysis. We could not confirm the genotype for the novel variant in HCN2 for two of the individuals from the WES analysis, because they were not included in the whole genome sequencing project. For the one WES individual who also had WGS data, the homozygous genotype for the alternate allele was confirmed. For the two individuals who were not in the WES analysis, one was heterozygous and one was homozygous for the reference allele. Thus, strong validation for this variant identified in Syrian Family 3 was not achieved due to the low number of individuals with WGS data.

Compound heterozygous analysis (WES discovery and WGS validation)
To determine if individuals in Syrian Family 1 had any other variants potentially contributing to risk for oral clefts, we searched for genes with more than one heterozygous variant in the WES data. Such an observation may indicate that the affected individuals have two distinct deleterious alleles at the same locus. Because we do not have phase information on these   For each variant we give the gene name, chromosome (Chr.), base pair (BP), alternate allele (A), reference allele (R), number of individuals homozygous for the alternate allele (AA), number of heterozygous individuals (AR), number of individuals homozygous for the reference allele (RR), the gene location (Location), the function of the variant (NS, nonsynonymous; S, synonymous), the frequency of the alternate allele for all populations in 1000 Genomes (1000G Freq.), the frequency from the Greater Middle East Variome Project (GME Freq.), the frequency from the Qatar Genome data (QG Freq.), and the number of sources that predict the base pair change to be damaging out of the nine present in wAnnovar. The dashes (-) represent variants that were not present in the specific frequency database.
individuals, we cannot rule out the possibility that the alternate alleles were inherited from the same parent as a rare haplotype. We still deem this as interesting as it could identify compound heterozygotes or the rare haplotype could itself confer increased risk for oral clefts. We defined one gene as possibly having compound heterozygotes when all three WES individuals had more than one exonic, nonsynonymous variant in that particular gene. While there were no other exonic and/or predicted deleterious rare variants identified in CASP9, we did identify four genes meeting our definition of compound heterozygosity (two variants in COL7A1, two variants in CELSR3, two variants in TKT, and three variants in NLRP14) (Table 4). We then performed a validation analysis of these results using the WGS data by assessing the genotypes for all exonic, nonsynonymous variants present in these four genes. No other exonic, nonsynonymous variants were found in these genes in any of the eight family members with both WES and WGS data. We were able to confirm the heterozygous genotypes for individuals with WES data in the WGS data. Two of the five additional WGS individuals were compound heterozygous for two identified genes: COL7A1 and TKT (Table 4).

WGS discovery and follow-up analyses
We performed a discovery analysis for the eight individuals with WGS data in Syrian Family 1 to search for potentially damaging variants that may have been missed in the WES analysis. First, we performed the same quality and allele frequency filtering steps as in the WES discovery analysis. We then performed genotype filtering requiring at least one family member to be homozygous for the alternate allele, and for the alternate allele to be present in at least seven of the eight other family members. We did not require that all eight of the individuals carry the alternate allele given the complex and distant relatedness patterns in this family. We further filtered variants based on annotation from wAnnovar. Specifically, we selected only rare, exonic, nonsynonymous variants. Using these steps, we identified one gene, FAT4, with three exonic, nonsynonymous variants, one of which was predicted to be damaging by three different sources in wAnnovar (Table 3). Genotypes for the eight total variants passing our filtering steps in this WGS validation and discovery results (all from Syrian Family 1) are listed in Table 5.
We also performed a follow-up analyses to determine if there were any rare variants in eight known enhancer regions and two promoter regions of CASP9 (Table S6). Enhancers were selected from the GeneHancer database based on their gene enhancer score (>5) (Fishilevich et al. 2017). We assessed promoters that were <200 kb from the transcription start site of CASP9 (Stelzer et al. 2016). Table S7 shows the results from this analysis. We did identify several heterozygous variants in these potential CASP9 regulatory regions. Interestingly, the individual with the most heterozygous variants in these regions was homozygous for the reference allele. Furthermore, no regulatory region variants were identified in the affected individuals that were homozygous for the alternate allele of the nonsynonymous, exonic CASP9 variant. This may indicate within-family allelic heterogeneity.

Discussion
In this study, our goal was to identify rare potential risk variants shared by distantly related individuals with oral clefts. To do this, we performed genotype and annotation filtering steps to find genes with variants that are not common in population databases such as 1000 Genomes, but for which affected individuals were enriched at both the population and family levels. We had the most power to do this in a single Syrian family, because it had the highest number of affected individuals with sequence data and the greatest overlap between WES and WGS data.
Our results reflected this, as we were able to detect eight exonic, nonsynonymous variants (four passing the single variant analysis filter and four passing the compound heterozygous filter) that show extensive sharing among these affected relatives from this highly consanguineous family. Thus, one or more of these variants may contribute to risk for oral clefts in this family. Our most intriguing finding was a nonsynonymous variant in CASP9 that occurred in seven of the eight family members (four of whom were homozygous for the alternate allele). We identified this as our most interesting finding for several reasons. First, this allele is not present in any of the other Syrian families, 1000 Genomes, the Qatar Genome Database, or the GME Variome project database. Second, it is a nonsynonymous, exonic variant predicted to be pathogenic by two different sources in wAnnovar. Finally, there is strong evidence that apoptotic genes play a role in the etiology of oral clefts. Among many other aspects of embryonic development, apoptosis plays a crucial role in craniofacial development. Failure of apoptosis during development may result in oral clefts (Smane et al. 2013). While no other studies to date have identified variants in CASP9, it is directly involved in an apoptotic signaling pathway shown to result in a facial cleft phenotype in mouse models (D'Amelio et al. 2010).
Three of the other genes identified in this study contain variants known to cause severe Mendelian syndromes (FAT4 in Van Maldergem syndrome [Alders et al. 2014, p. 4], COL7A1 in recessive dystrophic epidermolysis bullosa [Hovnanian et al. 1997], and TKT in a syndrome which includes short stature, developmental delay, and congenital heart disease [Boyle et al. 2016]). While none of these syndromes include oral clefts as a key phenotype, variants in FAT4 leading to Van Maldergem syndrome can present with craniofacial abnormalities (Alders et al. 2014, p. 4). Interestingly, the one individual from Syrian Family 1 who was homozygous for the reference allele for the variant in CASP9 was homozygous for the alternate allele at all three of the FAT4 variants and, further, had more than one variant predicted to be pathogenic in both COL7A1 and TKT. This same individual also had the highest number of rare variants (four) in enhancer and promoter regions of CASP9. It is also important to note that this individual had a phenotype (incomplete cleft palate) that was distinct from the other seven family members (all others had cleft lip with or without cleft palate) (Table 5). Together, this may represent withinfamily heterogeneity for genetic factors contributing to risk to nonsyndromic oral clefts.
One notable limitation of our study is there were no unaffected family members with WES data nor were there any sequenced unaffected individuals in the Syrian WGS data. This was a major limitation for our populationbased analyses, as many of these populations are not well represented in available allele frequency databases (e.g., 1000 Genomes and ExAC). Therefore, any of our interesting findings at the population-level may be populationspecific and not truly phenotype specific. While this is also a limitation in our family-based analyses, we have partly addressed this by limiting our results to those with a higher likelihood of being functional based on multiple Table 5. Genotypes for the eight individuals with WGS data in Syrian Family 1 for the WES and WGS validation and discovery results.

Single variant analyses
Compound Het. analyses annotation information (exonic, nonsynonymous, with some evidence of being pathogenic).
In this analysis we have identified rare and potentially damaging variants shared by affected family members in a single Syrian family. While these candidate genes and variants need to be assessed further in the unaffected and other affected members of this family, none have been previously identified as risk factors for nonsyndromic oral clefts and may indicate novel genetic underpinnings for this phenotype.

Supporting Information
Additional Supporting Information may be found online in the supporting information tab for this article: Table S1. Frequent hitter gene names. Table S2. Genes with variants that passed population-specific analysis filter in WES data in the Syrian cohort with WGS validation. Table S3. Genes with variants that passed population-specific analysis filter in WES data in the Indian cohort. Table S4. Genes with variants that passed population-specific analysis filter in WES data in the Filipino cohort with WGS validation. Table S5. Genes with variants that passed population-specific analysis filter in WES data in the German cohort. Table S6. The IDs and genomic regions for the eight enhancer regions and two promoter regions for CASP9. Table S7. Genotypes for the eight individuals with WGS data in Syrian Family 1 for the variants in enhancer and promoter regions of CASP9.