Evaluating rare coding variants as contributing causes to non-syndromic cleft lip and palate
The authors have no conflicts to disclose.
Rare coding variants are a current focus in studies of complex disease. Previously, at least 68 rare coding variants were reported from candidate gene sequencing studies in non-syndromic cleft lip and palate (NSCL/P), a common birth defect. Advances in sequencing technology have now resulted in thousands of sequenced exomes, providing a large resource for comparative genetic studies. We collated rare coding variants reported to contribute to NSCL/P and compared them to variants identified from control exome databases to determine if some might be rare but benign variants. Seventy-one percentage of the variants described as etiologic for NSCL/P were not present in the exome data, suggesting that many likely contribute to disease. Our results strongly support a role for rare variants previously reported in the majority of NSCL/P candidate genes but diminish support for variants in others. However, because clefting is a complex trait it is not possible to be definitive about the role of any particular variant for its risk for NSCL/P.
The genome-wide association study (GWAS), is a popular tool for identifying common alleles influencing susceptibility to complex diseases and traits. While hundreds of GWAS have been performed they collectively account for a small percentage of disease risk. As a result, there has been an increasing emphasis on the role of rare variants in response to the ‘missing heritability’ often seen with the GWAS approach . Prior to GWAS, candidate gene studies often included sequencing coding regions to identify mutations which, when absent in a few hundred controls, were suggested to be etiologic. Thousands of exomes have now been sequenced across a range of phenotypes, representing a catalog of variation that can be used as controls for disease studies [2, 3].
Non-syndromic cleft lip with or without cleft palate (NSCL/P) affects 1/1000 individuals worldwide . Linkage, candidate gene, and GWAS have been performed in search of genetic risk factors for NSCL/P. Here we review the mutations reported in 25 candidate genes and compare them to variants found in 6500 exomes publicly available through the 1000 Genomes Project (1kGP) and the NHLBI Exome Sequencing Project (ESP6500) to provide evidence supporting or diminishing an etiologic role for these variants in NSCL/P. This effort parallels earlier comparisons of loci identified from candidate gene association studies where many of those have failed to replicate in subsequent GWAS . Reasons for their failure to replicate association data include poorly matched controls, false positive results, failure to account for multiple comparison testing and ‘winner's curse’ where the results are positive but the overall genetic effect is overestimated causing replication attempts to use inadequate sample sizes. Some of these same phenomena could affect candidate gene sequencing studies and we sought to determine if one element of false positives – inadequate numbers of controls examined – might distinguish rare normal variants from those of etiologic significance.
Materials and methods
We broadly queried the PubMed database with terms related to NSCL/P including ‘cleft lip and palate’ and ‘non-syndromic cleft’. From papers (written in English) reporting candidate gene sequence, we manually curated a list of all missense, nonsense, and splicing variants described as ‘mutations’ or ‘rare variants’ contributing to the etiology of NSCL/P. We identified 67 variants reported in 25 genes (Supporting information, Table S1). Table S2 contains the mutation positions for genes in which the RefSeq transcripts have been updated since the original publications. We compared these variants with those obtained from the 1000 Genomes Project (1091 individuals, March 2012 data release) and the NHLBI Exome Sequencing Project (6503 individuals drawn from a range of studies investigating cardiovascular disease). For the purposes of this paper, we will use the term ‘control’ to describe these variants. Variants from the 1000 Genomes Project were annotated using the SeattleSeq snp annotation software (Build 134, http://snp.gs.washington.edu/SeattleSeqAnnotation134/). We used Polyphen2 to predict damaging effects of missense mutations. Truncation and splice mutations were considered probably damaging.
Previous resequencing of NSCL/P candidate genes identified 43 missense, nonsense, or splicing variants in 19 genes that were not present in the controls sequenced in those studies (Table 1). An additional 24 variants in 12 genes were reported in NSCL/P cases and matched population controls or samples of the CEPH Human Diversity Panel (Table S3). While some of these variants have been associated previously with NSCL/P, such as P153Q in MSX1  and W185X in PVRL1 , we chose to focus on the 43 variants unique to NSCL/P cases in this study. Of the 43 missense, nonsense, or splicing variants, 12 (30%) were present in the 1kGP or ESP6500 datasets, suggesting that they may be rare, but benign variants. However, the remaining 31 variants were not found in over 12,000 control chromosomes and may be etiologic variants contributing to NSCL/P.
Table 1. Rare missense, nonsense, and splicing variants reported in NSCL/Pa
|BMP4||A346V||Probably damaging||USA (WA)||MCL + BU||1||0||0|||
|FGF8||D73H||Possibly damaging||USA (IA)||CLP, CP||1||0||0|||
|FGFR2||R84S||Probably damaging||USA (IA)||CLP||1||0||0|||
|FOXE1||D285V||Possibly damaging||USA (IA)||CLP||1||0||0|||
|GLI2||R374H||Probably damaging||USA (IA)||CLP||1||1||0|||
|MSX2||S63C||Possibly damaging||USA (IA)||CLP||1||0||0|||
|NUDT6||K172N||Probably damaging||USA (IA)||CLP||1||0||0|||
|PVRL1||V89M||Probably damaging||N. America||CL/P||2||0||0|||
|PVRL1||IVS4 + 1||Splicing||Australia||CL/P||1|| || |||
|PVRL2||E41K||Possibly damaging||USA (IA)||CL/P||1||2||0|||
We compared the Polyphen2 predictions between rare variants found exclusively in NSCL/P cases, variants found in both datasets, and variants found only in the 1kGP and ESP6500 datasets (minor allele frequency < 1%) (Table 2). There was a higher frequency of damaging variants in the NSCL/P group (67.7%) than 1kGP/ESP6500 group (54.5%), however, this difference was not statistically significant (p = 0.14, Chi-square). Because not all of the sequencing variants identified in the NSCL/P resequencing studies have been reported, we could not directly compare the overall frequency of rare variants in cases to the frequency of variants in controls. These data show that Polyphen2 predictions are not sufficient to distinguish between these potentially pathogenic variants and the presumably benign variants of 1kGP and ESP6500.
Table 2. Polyphen2 predictions of rare variants (MAF<1%) in NSCL/P candidate genes
|Total||31|| ||12|| ||802|| |
Prior to the GWAS era, candidate genes investigations drove many studies searching for genetic etiology of NSCL/P. Genes sequenced in this approach included BMP4, FGF10, FGF8, FGFR1, FGFR2, FGFR3, FOXE1, GLI2, JAG2, LHX8, MSX1, MSX2, NUDT6, PAX9, PTCH1, PVR, PVRL1, PVRL2, RYK, SATB2, SKI, SPRY2, TBX10, TGFB3, and TP63 [4, 8]. IRF6, the causal gene in Van der Woude syndrome (VWS) , has also been studied by sequencing for its role in NSCL/P, but given the phenotypic overlap between VWS and NSCL/P we have not included it in this analysis. Overall, absence of 70% of variants from current exome databases with over 7500 available controls provides support for a role of rare variants in NSCL/P in many of the published candidate gene studies. These results strongly support a role for rare variants in MSX1, and members of the FGF signaling pathway. In addition, neither of the two de novo mutations (D73H in FGF8 and R352G in TP63) [10, 11] were present in the exome databases. Both mutations occur at highly conserved residues and were not present in 1000 population matched controls. With the addition of another 7000 controls, there is a statistically significant association of these mutations with NSCL/P (p = 0.03). However, assuming that each gene has one mutation (after sequencing 200 cases), approximately 20,000 control exomes would be needed for the variants in these candidate genes to be significant as a group.
Although most rare variants are functionally deleterious , the variants found in complex diseases are less penetrant than their Mendelian counterparts. The majority of the rare variants reported in individuals with NSCL/P are inherited from unaffected parents and/or are found in unaffected siblings. Because of this, it is thought that the development of NSCL/P is ultimately the result of the combined action of additional genetic, environmental, or stochastic factors. Thus the missense mutations found in controls as very rare, but possibly benign variants, may be relevant risk factors that require additional factors to manifest an effect. Testing these variants in model systems may provide supportive data for their contributory role.
Comparisons of mutations causing Mendelian disorders with public databases is intuitive because Mendelian disorders are typically rare, making it less likely that that these disorders are present in the individuals sequenced in the 1kGP or ESP6500. This assumption is weaker for comparisons with less rare disorders such as NSCL/P. There is no phenotype data attached to the sequencing information from 1kGP or ESP6500, however, with a frequency of 1/1000 the presence of cases of NSCL/P is likely to be minimal. One exception is for the mutations in BMP4 associated with subclinical defects of the orbicularis oris muscle, which are present in 3%–11% of controls .
Although the 1kGP has dedicated tremendous effort to sequencing diverse, worldwide populations, many of the populations commonly studied for NSCL/P are not well represented. The prevalence of clefting varies by ancestry, most commonly affecting those of Asian or Amerindian descent (1/500) and less commonly those of African descent (1/2500) . Many of the reported variants were found in individuals from the Philippines, Chile, Uruguay, Mongolia, and Thailand, none of which are currently represented in the 1kGP. While we cannot exclude the possibility that some of these variants are rare, population-specific polymorphisms, it is, however, encouraging that many of the mutations identified in the populations were not present in matched controls or in the populations available from the 1kGP or ESP6500. Even with 7500 exomes, we may still be missing rare variants of very low frequency. Recent modeling work [14, 15] suggest that the impact of rare variants on heritability may be greater than previously thought but that the frequencies, as a result of recent human population growth, may be below the thresholds detectable even by the 7500 controls used here.
Although many of the mutations reported in GLI2, JAG2, and SPRY2 were also found in the exome data, diminishing a role for these rare variants in NSCL/P, these genes remain candidates through supporting biological or statistical data . Ultimately, the capacity of large control data sets will provide supportive data for the role of specific mutations in either contributing to common disease or being of no or little impact. As whole exome and whole genome sequencing will continue to add to control data that it will be necessary to reevaluate the role of discovered mutations for the role they play in NSCL/P and other common disorders.
We would like to thank Chris Lopez for assistance with the literature review and thank the many laboratories that have contributed to the studies reported here and the many families who have participated in research protocols to make these studies possible. Our own candidate gene work has been especially enabled by collaborations with Kaare Christensen and Mary Marazita. The authors would like to thank the NHLBI GO Exome Sequencing Project and its ongoing studies which produced and provided variant calls for comparison: the Lung GO Sequencing Project (HL-102923), the WHI Sequencing Project (HL-102924), the Broad GO Sequencing Project (HL-102925), the Seattle GO Sequencing Project (HL-102926), and the Heart GO Sequencing Project (HL-103010). This work was supported by NIH grants R37-DE008559 and U01-DE020057. EJL was supported by NIH Training Grant GM008629.