Large-scale objective association of mouse phenotypes with human symptoms through structural variation identified in patients with developmental disorders


  • Hannah Boulding,

    1. MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    Search for more papers by this author
  • Caleb Webber

    Corresponding author
    1. MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    • MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, UK.
    Search for more papers by this author

  • For the Deep Phenotyping Special Issue


Copy number variants (CNVs) are thought to underlie many human developmental abnormalities. However, it is unclear how many of these CNVs exert their pathogenic effects or, in particular, how distinct CNVs at dispersed loci can give rise to the same abnormality. We hypothesize that the mouse orthologs of genes whose copy number change gives rise to the same human abnormality might also yield a similar phenotype when disrupted in mice. Thus, by bringing together a large number of disparate CNVs, we may be able to identify an unusually overrepresented phenotype among the affected genes' mouse orthologs. We obtained 1,624 de novo CNVs identified in patients with developmental abnormalities from Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources and European Cytogeneticists Association Register of Unbalanced Chromosome Aberrations database. Forming CNV sets for each of 1,088 distinct human abnormalities, we were able to associate a total of 143 (13%) human abnormalities with mouse model phenotypes. Although many mouse phenotypes are readily comparable to their associated human abnormality, others are less so, generating novel biological hypotheses. Of the 2,086 candidate genes that contribute to these associations, 65% have not been previously associated with human disease in Online Mendelian Inheritance in Man, and their distribution suggests both extensive pleiotropy and epistasis while also proposing a small number of simple additive consequences. Hum Mutat 33:874–883, 2012. © 2012 Wiley Periodicals, Inc.


Developmental disorders occur in approximately 3% of births—for example, mental retardation (intellectual disability) has a prevalence of 1–3% [Chelly et al., 2006]; autism, 1% [Kogan et al., 2009]; facial clefts, 0.12% [Bister et al., 2011]; and congenital heart disease, 0.6% [Hoffman and Kaplan, 2002]). Often these patients are found to harbor copy number variants (CNVs; duplicated or deleted regions of the genome larger than 1 kb) that are suspected of underlying the patient's symptoms [Lu et al., 2007; Stankiewicz and Beaudet, 2007; van Karnebeek et al., 2005]. Several developmental disorders are known to be the result of recurrent copy number variation at particular loci [van Binsbergen, 2011]. However, many patients presenting with similar symptoms possess non-overlapping CNVs, making it difficult to definitively associate the copy number change of any particular gene with an individual's disorder.

To address this problem, databases such as Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER) [Firth et al., 2009] and European Cytogeneticists Association Register of Unbalanced Chromosome Aberrations (ECARUCA) [Feenstra et al., 2006] have been established, which seek to collate cytogenetic and clinical data across a large number of patients. Collecting large numbers of CNVs from symptomatically similar patients enables not only the robust association of a recurrent copy-number-variable region, but also the application of functional enrichment analysis (FEA) approaches to dispersed CNVs [Shaikh et al., 2011; Webber, 2011; Webber et al., 2009]. FEA approaches hypothesize that non-overlapping CNVs observed in patients with a shared disorder may be affecting genes that participate in a common biological process, and it is the disruption of the same process within each of these patients that underlies their common symptom [Shaikh et al., 2011; Webber et al., 2009]. Thus, FEA can ask whether there is anything unusually common about the functions of genes overlapped by dispersed CNVs identified in the genomes of patients that present shared symptoms. In contrast to approaches that identify causal genetics through variation at a single locus, FEA approaches gain considerable power to identify a commonly affected process by simultaneously examining the contributions of many disparate variants across many individuals' genomes. When investigating the role of structural variation in diseases, FEA tests the null hypothesis that CNVs identified in patients with a common set of symptoms are randomly sampling the genome. The robust rejection of the null associates the detected functional bias with these patients' variants while singling out those genes that contribute to this association and the CNVs that harbor them as candidate causal elements for further investigation in the context of these symptoms.

How function is defined in FEA approaches is the key to their success. Our approach to identifying biological processes disrupted in human developmental disorders differs from others in that we exploit expertly annotated phenotypes arising from the targeted disruption of genes within the mouse [Shaikh et al., 2011; Webber, 2011; Webber et al., 2009]. The deliberate genetic manipulation of mice has contributed enormously to our understanding of gene function, while mouse models created to genetically mimic human disorders frequently provide mechanistic insights [Austin et al., 2004; Bult et al., 2008; Delorey et al., 2011; Silverman et al., 2010]. Here, we posit that the overrepresentation of particular phenotypes observed among mouse models for a set of human orthologs reveals biological processes that are shared among those orthologs. Thus, an unexpectedly frequent model phenotype among the mouse orthologs of genes overlapped by a group of distinct CNVs identified in patients with a common symptom would posit three conclusions: firstly, the objective association of the mouse model phenotype with the human symptom; secondly, that those genes whose mouse orthologs contribute to the mouse model phenotype association are candidate genes whose mutation underlies the human symptom; and thirdly, that those CNVs harboring these genes are candidate causal agents whose copy number change underlies the disorder. Importantly, rather than subjectively imposing an anthropomorphic view of the manifestation of human disorders in a nonhuman species, our approach allows mouse models and phenotypes relevant to the human disorder to be identified objectively [Chadman et al., 2009; Nestler and Hyman, 2010].

To identify accurate associations between human symptoms and mouse model phenotypes, both human and mouse phenotypes must be defined in a consistent manner. Both the DECIPHER and ECARUCA human clinical cytogenetic databases describe their patients clinical symptoms using the London Medical Database (LMD) terms, a three-layer ontology that organizes human symptoms into a hierarchical structure, with each more specific term listed beneath a more general single parent term. [Fryns and de Ravel, 2002]. This hierarchical structure enables the investigation of smaller groups of patients presenting very specific symptoms, as well as grouping patients into larger datasets to examine more general human disorders. For the mouse phenotypes, the Mouse Genome Informatics (MGI) resource describes expertly annotated and published phenotypes resulting from a mouse gene's disruption using terms defined within the mammalian phenotype ontology (MPO) [Smith and Eppig, 2009]. As with other literature-based gene annotation sources such as the Gene Ontology [Harris et al., 2004] or Kyoto Encyclopedia of Genes and Genomes [Kanehisa et al., 2008], information is not available for all genes, nor necessarily comprehensively annotated for those that are covered. Nonetheless, the knockout information held by the MGI may be of particular relevance to human copy number change, while the reporting of mouse model phenotypes in terms of MPO abnormalities may be particularly relevant to disease [Eppig et al., 2007].

In this study, we associate over 100 mouse model phenotypes with human clinical symptoms by identifying a significant enrichment of genes whose mouse orthologs' disruption yield these phenotypes within the de novo CNVs of patients presenting these symptoms. In addition to exploring different strategies for identifying these associations, we find evidence suggesting extensive pleiotropy and epistasis, along with the number of patients that appear to exhibit the additive effects of multiple copy-number-changed genes.

Materials and Methods

CNV Datasets

We obtained two very large sets of de novo CNVs for analysis. The largest set, from the ECARUCA database [Feenstra et al., 2006], consisted of 988 CNVs observed in 958 patients (Table 1). The second set consisting of 636 CNVs from 525 patients was obtained from the DECIPHER database [Firth et al., 2009] (Table 1). In each set, contributing patients are annotated with terms from the LMD [Fryns and de Ravel, 2002]. In each database, patient and CNV data are entered by a clinical geneticist and molecular cytogeneticist, respectively. In order to ensure consistent human symptom annotation among the different centers, patient symptoms are described using terms from the LMD. Most patients in both sets are described by multiple LMD terms (median = 7).

Table 1. Summary Statistics of the DECIPHER and ECARUCA Datasets
CNV setNumber of CNVsCNV size range (Mb)Median CNV size (Mb)Number of human symptoms (median)Median number of symptomsGenes covered (median/CNV)
  1. CNV size range and genomic coverage of CNVs.

DECIPHER6360.003–53.72.351–3210,487 (18)
ECARUCA9881.3–14618.182–10217,791 (112)

The LMD is a three-layer ontology consisting of 1,682 terms describing human phenotypic abnormalities (Supp. Fig. S1). At its highest level, the LMD consists of 34 overarching human symptom terms (e.g., neurology, skeleton, etc.). Each of these parent terms has multiple children (e.g., mental cognitive function, general abnormalities) and grandchildren terms (e.g., mental retardation) listed beneath it. Each of the most specific terms within the LMD are described beneath one single parent term, and there is no overlap between the branches of the ontology.

For each different symptom, as defined by the LMD and observed in our sample, we formed a group of de novo CNVs drawn from those patients annotated with that symptom, termed symptom–CNV sets (Fig. 1). We formed sets separately for both DECIPHER and ECARUCA, as well as for the combined set of DECIPHER and ECARUCA CNVs, termed combined symptom–CNV sets. We also considered that the direction of copy number change (deletion/duplication) could affect the underlying pathoetiology of a particular disorder, and thus each CNV set formed was further subdivided into “loss” (deletion) and “gain” (duplication) CNV sets associated with a particular LMD symptom.

Figure 1.

The formation of symptom–CNV sets. The CNVs were grouped nonexclusively into symptom–CNV sets. Stage 1: For each symptom–CNV set, genes overlapped by these CNVs were obtained from ENSEMBL (see Materials and Methods section). Stage 2: Genes observed to be copy number variable within apparently healthy individuals were removed.

Assigning Genes to CNVs

Genes were defined by ENSEMBL EnsMart 54 [Flicek et al., 2010]. Protein-coding genes were assigned to a particular CNV if that CNV overlapped at least one protein-coding exon from every known transcript of that gene, thereby ensuring that protein-coding sequence would be affected whichever transcript was expressed. This method of determining those genes affected by a CNV has been shown to reduce the effects from significant length biases associated with genes that show tissue-specific expression patterns [Webber, 2011]. As we are interested in penetrant deleterious changes in gene copy number, we remove genes from each of the symptom–CNV gene set that are observed to be copy number variable in the same direction within apparently healthy individuals. These apparently benign copy number variable genes were those identified in the Redon et al. [2006] and Nguyen et al. [2008] datasets. Within each symptom–CNV set, a gene is listed only once even if it is observed to be copy number variable in more than one patient with the same symptom.

Mouse Model Phenotypes

The descriptions of mouse phenotypes resulting from the disruption of mouse orthologs of human genes were obtained from the MGI online resource [Eppig et al., 2007]. The mouse model phenotypes are described through the MPO, which is organized into 29 overarching categories with multiple levels of finer phenotypic terms beneath them [Smith and Eppig, 2009]. Each gene is annotated with the most specific observed phenotype term reported within a cited publication describing the mouse model along with the overarching MPO category under which that term is described. For our analysis, we assigned all intermediate phenotype terms between the overarching and finest mouse phenotype to a gene [Shaikh et al., 2011]. In order to reduce uninformative results and to exclude testing phenotypes with insufficient power to reach significance, we only considered those finer mouse MPO phenotype terms associated with at least 1% of all genes annotated within the overarching MPO phenotype category [Webber et al., 2009].

Using 1:1 gene orthology relationships between mouse and human genes defined by the MGI, we mapped 5,283 MPO phenotype terms to 5,671 human ENSEMBL genes. For each CNV set in the DECIPHER, ECARUCA, and combined sets of DECIPHER and ECARUCA, we tested for an enrichment of genes whose disrupted mouse orthologs were associated with specific mouse phenotypes (Fig. 2). In previous studies that considered patient cohorts presenting with a narrower range of symptoms, we employed a two-step procedure: First, we examined each of the 29 overarching MPO categories for a broad-scale enrichment, and then we examined each finer phenotypic term within any overarching category found to be significantly enriched [Webber, 2011; Webber et al., 2009]. However, as more than 90% of patients are annotated with multiple LMD symptom terms in this study, the vast majority of these patient's CNVs will be investigated in this study for a putative causative role in multiple distinct symptoms, as each CNV may be assigned to multiple symptom–CNV sets. In order to focus the investigation of each symptom–CNV set toward the symptom of interest, we tested those finer mouse phenotypic terms within those overarching categories of mouse phenotypes relevant to the symptom represented by the particular symptom–CNV set (Supp. Table S1) [Shaikh et al., 2011].

Figure 2.

Examining human symptom–CNV sets for enrichments of genes associated with mouse phenotypes. 1: Formation of a set of CNVs associated with a human symptom. 2: Identification of human genes affected (duplicated or deleted) by the CNVs. 3: Identification of unique mouse orthologs of affected human genes. 4: Identification of mouse phenotype enrichments among gene sets. 5: Identification of mouse phenotype enrichments significantly above that expected by chance. [Color figure can be viewed in the online issue, which is available at]

Given the absence of a suitable control, we compared the gene function content of these de novo CNV sets to the genomic background [Raychaudhuri et al., 2010], noting that our gene assignment procedure does not appear to incur any concerning tissue-specific gene bias [Webber, 2011]. A hypergeometric test was employed to test the null hypothesis that de novo CNVs identified in patients with a specific developmental symptom randomly sample all genes. As many mouse phenotypes were tested for enrichment within each human symptom-CNV gene set, a multiple testing correction, false discovery rate (FDR) less than 5% was applied [Benjamini, 1995]. The application of this significance threshold is conservative, given that the FDR correction assumes that each test is independent. In reality, many phenotype terms within the MPO are directly related.


We sought to objectively associate mouse model phenotypes with human disease symptoms. To accomplish this, we examined dispersed de novo CNVs observed in human patients that present a shared symptom for enrichments of genes whose mouse orthologs, when disrupted, result in the same mouse phenotype.

We obtained two large sets of de novo CNVs identified in patients with multiple developmental symptoms. The first set from DECIPHER comprises 636 de novo CNVs identified in 525 patients [Firth et al., 2009] (Table 1). The second set obtained from the ECARUCA database comprises 988 de novo CNVs identified in 958 patients (Table 1) [Feenstra et al., 2006]. The median size of the DECIPHER CNVs is 2.2 Mb (mean = 3.7 Mb, SD = 4.4 Mb), whereas the ECARUCA CNVs are significantly larger with a median size of 18.1 Mb (mean = 21.2 Mb, SD = 14 Mb). For both datasets, patient symptoms are described using the LMD, which defines 1,682 human symptoms arranged into hierarchical relationships with three levels of specificity (see Supp. Fig. S1 and Materials and Methods section) [Fryns and de Ravel, 2002]. In total, 685 different symptoms are presented by patients whose CNVs form the DECIPHER set, while 892 different symptoms are presented by patients whose CNVs form the ECARUCA set. Of these, 489 symptoms are observed in both sets and thus, in total, 1,088 of the 1,682 LMD symptoms (65%) are represented in at least one patient.

The vast majority of patients in both datasets present with multiple symptoms: DECIPHER, median = 5, range = 1–32; ECARUCA, median = 8, range = 2–102 (Table 1). Across all DECIPHER and ECARUCA patients' symptoms, 91 and 97% are described at the most specific level of the LMD ontology (Supp. Fig. S1). In order to identify mouse model phenotypes associated with individual human symptoms, for each of the two datasets separately, we formed a nonexclusive set of de novo CNVs drawn from the subset of patients with a particular symptom, herein termed symptom–CNV sets (Fig. 1). As 97% of patients present with more than one symptom, almost all of the CNVs belong to more than one symptom–CNV set. Symptom–CNV sets were formed at each level of symptom specificity within the LMD hierarchy in order to exploit the power gained by merging more specific symptom–CNV sets. For any patient with a child or grandchild LMD term, we imputed the LMD terms above it. For the 489 symptoms observed among both the DECIPHER and ECARUCA patients, additional larger symptom–CNV sets were created from the union of patients presenting with the same symptom in each dataset; herein termed combined symptom–CNV sets. Considering that the direction of copy number change (gain/loss) of a gene could affect the underlying pathoetiology of a symptom, we further subdivided each of the symptom–CNV sets to form groups of gain and loss CNVs (see Discussion section). For each of the symptom–CNV sets, the protein-coding genes affected by the CNVs were identified using ENSEMBL and assigned using a procedure demonstrably less prone to known gene length biases (see Materials and Methods section) [Webber, 2011].

Identifying Mouse Model Phenotypes Associated with Developmental Disorders

Mouse phenotype data resulting from the targeted disruption of 5,671 unique 1:1 human–mouse orthologs were obtained from the MGI resource [Eppig et al., 2007]. These genes' model phenotypes are annotated according to the MPO, which defines 5,283 phenotype terms within 29 overarching categories. As the vast majority of patients present with a broad range of symptoms, the CNVs within a given symptom–CNV set may each be associated with symptoms from several different LMD categories. Thus, in order to focus the analysis of a given symptom–CNV set toward the human symptom of interest, we considered only those mouse model phenotypes described under the phenotype categories deemed most clearly and directly relevant to the human symptom under investigation (Supp. Table S1). The number of mouse phenotypes tested for a symptom–CNV set ranged from 245 to 2,655, and thus we employed a highly conservative multiple-testing correction to keep the FDR to less than 5% [Benjamini, 1995].

Of the 685 DECIPHER symptom–CNV sets and 892 ECARUCA symptom–CNV sets tested, we identified 46 (7%) and 101 (11%), respectively, that were significantly enriched (FDR < 5%) in genes whose disrupted mouse orthologs result in particular mouse phenotypes (Tables 2 and 3; Supp. Tables S2 and S3). Of these, symptom–CNV sets for two symptoms that are both represented in DECIPHER and ECARUCA—namely, “mental retardation” and “syndactyly of toes,” are enriched in genes associated with the same mouse model phenotypes—namely, abnormal brain morphology and abnormal skeleton extremities morphology, respectively. For both DECIPHER and ECARUCA, many mouse phenotype enrichments are readily comparable to the human disorder under investigation, while others are less comparable and generate interesting and novel pathoetiological hypotheses (Fig. 3; see Discussion section). Among the 489 combined symptom–CNV sets, we observe mouse phenotype enrichments for 74 (15%) human symptoms, of which 41/74 (55%) were not observed when considering each dataset separately (Supp. Tables S4 and S5).

Figure 3.

Example enrichments of mouse phenotypes among CNVs observed in patients with a common symptom. Enrichments are shown as the percentage increase over that expected by chance. Those marked with an asterisk are significant (FDR < 5%). The enrichments shown in panels A, B, E, and F are those identified in DECIPHER symptom–CNV sets (Table 2; Supp. Table S2), while those shown in panels C and D were identified in ECARUCA symptom–CNV sets (Table 3; Supp. Table S3).

Table 2. The Topmost Significant Mouse Phenotype Enrichment Observed Among Genes in Each DECIPHER Symptom–CNV Set
Human symptomMouse phenotypic enrichment% EnrichedPatients hitGene count
  1. Enrichments are given as the percentage change over that expected by chance. The number of patients with at least one gene contributing a mouse phenotype enrichment is given as a fraction of the total number of observed human patients presenting with that symptom. The full listing of all associated mouse models phenotypes for DECIPHER symptom–CNV sets is given in Supp. Table S2.

BuildDecreased birth body size1829/1712
Thin or slender build, general abnormalitiesDecreased fetal size17911/2620
Low birth weightDecreased fetal size2139/2416
Short stature, general abnormalitiesIncreased lean body mass2599/728
Short stature, prenatal onsetAbnormal fetal growth/weight/body size13812/1921
Tall stature, proportionateAbnormal chest morphology4,2451/12
Prominent forehead/frontal bossingAbnormal soft palate1,2623/274
Broad base to noseMalocclusion2,1172/44
Large noseAbsent palatine shelf3,1722/53
Cupid bow shape of mouthAbnormal secondary palate development1,4912/24
Open mouth appearanceBranchial arch hypoplasia3,6812/73
Thin lower lipAbnormal tooth mineralization8,4281/12
Short philtrumMalocclusion1,2813/74
Malocclusion of teethSmall branchial arch2,5422/23
Nasal speechAbnormal prepulse inhibition2,4681/55
Speech defect/dysarthriaAbnormal thermal nocioception1,0593/56
Loose skin in neckPremature hair loss6,9891/12
Scapulae, general abnormalitiesAbnormal cartilage morphology7722/35
Broad handsShort femur1,3772/44
CamptodactylyAbnormal metacarpal bone morphology8252/86
ClinodactylyDecreased long bone epiphyseal plate size5605/286
Thin brittle nailsAbnormal limb/digit/tail morphology4642/26
Syndactyly of two to three toesAbnormal metacarpal bone morphology8231/95
Syndactyly of toes (not two to three)Abnormal metacarpal bone morphology1,7026/76
Hematology/immunologyAbnormal T-cell activation1809/1320
Hematology/immunology, general abnormalitiesAbnormal T-cell activation1809/1320
NeurologyAbnormal brain morphology15275/384401
Mental cognitive function, general abnormalitiesAbnormal brain morphology15256/352383
Mental retardation/developmental delayAbnormal brain morphology15253/349383
HyperactivityAbnormal conditioning behavior5895/249
Psychotic behaviorAbnormal prepulse inhibition3,1821/15
Seizures, general abnormalitiesIncreased sensory neuron number25216/6713
Complex partial seizuresAbnormal circadian rhythm4,0702/23
Febrile convulsionAbnormal ammon gyrus morphology2,8641/13
Spinal myoclonusAbnormal prepulse inhibition1,3771/14
Paroxysmal disorders general abnormalitiesAbnormal prepulse inhibition1,2632/46
Intermittent tremor at restAbnormal prepulse inhibition1,3771/14
BrachycephalyAbnormal Purkinje cell dendrite morphology7475/136
Dolichocephaly/sachocephalyAbnormal cued conditioning behavior1,1562/36
Ataxia, general abnormalitiesAbnormal limbic system morphology2646/812
Pyramidal signs, general abnormalitiesAbnormal neurotransmitter secretion7303/76
Spasticity/brisk reflexes/BabinskiAbnormal neurotransmitter secretion7403/76
Neuroradiology general abnormalitiesAbnormal cerebellum development2909/1311
Basal ganglia lesionAbnormal prepulse inhibition1,3771/14
Thalamic lesionAbnormal prepulse inhibition1,3771/14
Nevi or lentiginesThick dermal layer8,2401/12
Table 3. The Topmost Significant Mouse Phenotype Enrichment Observed Among Genes in Each ECARUCA Symptom–CNV Set
Human symptomMouse phenotypic enrichment% EnrichedPatients hitGene count
  1. Enrichments are given as the percentage change over that expected by chance. The number of patients with at least one gene contributing a mouse phenotype enrichment is given as a fraction of the total number of observed human patients presenting with that symptom. The full listing of all associated mouse models phenotypes for ECARUCA symptom–CNV sets is given in Supp. Table S3.

BrachycephalyAbnormal malleus morphology9032/11024
Dolichocephaly/scaphocephalyAbnormal malleus morphology16617/6017
MicrocephalySmall branchial arch5882/21837
Plagiocephaly/asymmetrical skullAbnormal malleus morphology19714/3523
Cerebral atrophy/heterotopiasAbnormal basisphenoid bone morphology14817/3618
Cerebellar abnormality/hypoplasia (structural)Abnormal nasal bone morphology25211/1514
Hydroceph/large ventricles non-specificAbnormal external auditory canal13819/10013
Flat occiputAbnormal malleus morphology12915/5318
Delayed closure of/large fontanelleAbnormal maxilla morphology7253/7460
Ridged cranial suturesAbnormal hyoid bone morphology5503/47
Wide cranial suturesAnencephaly3859/207
Prominent forehead/frontal bossingAbnormal palate morphology2893/151125
Hyperplastic supraorbital ridgesAbnormal basioccipital bone morphology7972/25
Metopic ridgeAbnormal occipital bone morphology7982/27
Narrow forehead/temporal narrowingSmall branchial arch10317/6022
Sloping foreheadAbnormal occipital bone morphology8402/27
Hypoplastic supraorbital ridgesBranchial arch hypoplasia4585/147
Dysplastic earsAbnormal malleus morphology6079/22431
Posteriorly rotated earsAbnormal endolymphatic duct morphology11534/11619
Preauricular pits/fistulasAbnormal organ of Corti-supporting cell morphology16825/3618
Prominent earsAbnormal malleal manubrium morphology14721/8111
Simple earsAbnormal malleus morphology21922/3518
Prominent antihelixAbnormal malleus morphology14915/3818
Corneal abnormalitiesAbnormal cornea morphology3523/49
Coloboma of irisAbnormal eye development7529/3540
Iris atrophy/dysplasiaAnophthalmia5564/46
Absent eyelidsAbnormal basioccipital bone morphology9742/35
BlepharophimosisAbnormal hard palate7828/7928
Palpebral fissures slant downAbnormal craniofacial development40121/150139
Epicanthic foldsMandible hypoplasia5780/26721
Broad base to noseAbnormal malleus morphology13119/4518
Large noseAbnormal head morphology4824/2873
Flat noseSmall nasal bone4217/217
Small/short noseAbnormal hard palate6762/15335
Depressed/flat nasal bridgeAbnormal craniofacial development2472/222173
High/prominent nasal bridgeAbnormal maxillary shelf12721/8216
Wide nasal bridgeAbnormal hard palate6576/17033
Anteverted naresAbnormal craniofacial development26103/143136
Flared naresAbnormal malleus morphology6503/46
Asymmetric faceAbnormal middle ear ossicle morphology15827/4123
Midface hypoplasia (excluding flat malar)Abnormal first branchial arch morphology12948/8521
Flat malar regionAbnormal mandibular angle morphology8918/175
Small mandible/micrognathiaAbnormal craniofacial development18316/395202
Prominent maxillaAbnormal second branchial arch morphology4415/78
Downturned corners of the mouthAbnormal molar morphology7354/14827
Long philtrumAbnormal malleal manubrium morphology9824/13812
Short philtrumAbnormal maxilla morphology6346/7143
Wide philtrumAbnormal hyoid bone morphology2344/1011
Cleft upper lip (nonmidline)Abnormal palatal shelf fusion at midline21931/4212
Prominent upper lipAbnormal malleus morphology15410/3316
High palateSmall branchial arch4092/22341
Prominent lateral palatine ridgesAbnormal third branchial arch morphology7881/15
Delayed tooth eruption/developmentAbnormal skull morphology6314/1471
Irregular or crowded teethAbnormal craniofacial development4830/3780
Neonatal teethAbnormal occipital bone morphology4753/47
Abnormally shaped teethAbnormal nasal capsule morphology5576/137
LordosisAbnormal calvaria morphology2622/212
Meningocele/meningomyeloceleAbnormal cranial base morphology4594/48
ScoliosisAbnormal metacarpal bone morphology11512/6319
Sacral dimple/sinusAbnormal rib morphology4555/6583
Vertebrae, general abnormalitiesAbnormal malleus morphology20710/2118
Asymmetric thoraxDecreased birth body size1388/1216
Broad/barrel thoraxAbnormal chest morphology4532/26
Pulmonary incompetenceAbnormal cardiovascular development4751/16
Cardiac situs inversus/dextrocardiaAbnormal heart ventricular pressure1,4291/14
Tricuspid incompetenceAbnormal tricuspid valve morphology3746/118
Respiratory difficulties, generalAbnormal olfactory placode morphology6073/45
Abdomen, general abnormalitiesAbnormal small intestine morphology4522/27
Small bowel atresia/absence/obstr.Abnormal small intestine morphology4102/29
Feeding problems in infantsAbnormal palate development6354/9832
Inguinal herniaCleft palate4246/6782
Megacolon or Hirschprung's syndromeDecreased pancreatic beta cell number7202/65
Abnormal liver (including function)Abnormal pancreatic alpha cell morphology1,0963/45
Stomach tumorsAbnormal intestinal goblet cells1,4751/14
Pelvis, general abnormalitiesAbnormal thoracic cage2612/220
Pubic ossification defectAbnormal occipital bone morphology1,3111/15
Fused labiaAbnormal prostate gland morphology6032/26
Nephritis or nephropathyRenal fibrosis6561/26
Renal tumors (including Wilms)Renal fibrosis62719/196
Webbing at elbowAbnormal long bone morphology8911/15
Webbing at elbowAbnormal long bone morphology8911/15
Fingers, general abnormalitiesAbnormal forelimb morphology5791/17
Adducted thumbsDecreased long bone epiphyseal plate size2609/1610
Proximal placement of thumbAbsent radius26210/538
Absent or hypoplastic patellaAbnormal long bone morphology8911/15
Genu valgumAbnormal forelimb morphology2153/315
Genu varumAbnormal caudal vertebrae morphology6692/26
Hypoplastic or absent tibiaAbnormal long bone morphology3962/27
Ankle, general abnormalitiesAbnormal long bone morphology8911/15
Short halluxAbnormal skeleton extremities morphology1239/929
Syndactyly of two to three toesShort radius10622/7020
Syndactyly of toes (not two to three)Abnormal skeleton extremities morphology5824/2653
Bleeding diathesesDecreased spleen weight2,3061/14
Recurrent infectionsAbnormal T-cell apoptosis5659/10651
Hemiplegia/tetraplegiaAbnormal prepulse inhibition1,0581/25
HypotoniaAbnormal brain morphology12291/298601
LethargyAbnormal sensory neuron morphology2692/213
Mental retardationAbnormal forebrain morphology878/711560
Seizures/abnormal electroencephalogramAbnormal brain morphology1572/175507
Cartilagineous exostosesAbnormal skeleton development16013/1335
OsteoporosisCervical vertebral transformation3853/69
Stippled or fragmented epiphysesAbnormal calvaria morphology5042/210

Although it is plausible that smaller symptom-causing CNVs may overlap fewer non-disease-associated genes, thereby increasing functional enrichments, we did not find any benefit from considering only smaller CNVs. For example, considering only DECIPHER de novo CNVs less than 2.5 Mb in size yielded enrichments for 35/516 (6.7%) human symptoms, while considering ECARUCA less than 25 Mb in size yielded enrichments for only 61/831 (7.3%). Indeed, larger CNVs are observed to be more enriched in patients with developmental disorders than smaller CNVs [Vermeesch et al., 2011].

We then examine how the growth of these CNV databases affects the power of FEA analyses. We performed a resampling analysis of the DECIPHER patients using 25, 50, and 75% of the total patient sample, resampling 100 times at each fraction. As the fraction of patients sampled increased, we were able to identify enrichments for an increasing number of human symptoms: 18 symptoms (2.6%) in 25% of patients, 28 (4.1%) in 50% of patients, 36 (5.3%) in 75% of patients, as compared to 46 (6.7%) when employing all DECIPHER patients (Supp. Fig. S2). These results demonstrate the increased power made available to medical genomics by pooled data.

Mouse Model Phenotype Associations Identify Candidate Genes for Human Symptoms

The genes that contribute to the significant mouse model phenotype associations identified above are singled out as candidate genes, whose copy number change causally underlies the patient's phenotype. The 46 symptom–CNV sets with significant mouse phenotype enrichments from DECIPHER identify 595 candidate genes, while the 101 significantly enriched symptom–CNV sets from ECARUCA identify 1,896 candidate genes (Supp. Table S6). Of these genes, 463 candidate genes were identified by both the DECIPHER and ECARUCA CNV sets, and of these, 399 (86%) are drawn from the human symptoms “mental retardation” and “syndactyly of toes” (the two human symptoms with the same mouse phenotype enrichments across the two datasets). In total, these candidate genes provide a causal hypothesis for one or more symptoms in 54% of DECIPHER patients and 96% of ECARUCA patients. Of the 595 DECIPHER candidate genes and the 1,896 ECARUCA candidate genes, 199 (33%) and 671 (35%), respectively, are described as associated with human disease within Online Mendelian Inheritance in Man (OMIM) [Hamosh et al., 2005; McKusick, 1998] (Supp. Table S6). Among the 463 candidate genes identified separately by both DECIPHER and ECARUCA symptom–CNV sets, the proportion already associated with disease remains constant at 164 genes (35%). We compared the mental retardation candidate genes identified in this study to those identified by Webber et al., who employed a similar approach [Webber et al., 2009]. After discounting those CNVs from DECIPHER used in both studies, we replicate 30/55 (55%) of these candidate genes' association to mental retardation in one or both of the DECIPHER and ECARUCA datasets.

Although the combined symptom–CNV sets covered fewer symptoms, the increased power nonetheless led to the identification of 2,086 candidate genes whose copy number change contributes to the disorders of 86% of the 1,433 patients analyzed (Supp. Table S6). Among the combined symptom–CNV candidate genes, 167 (8%) are not identified by either DECIPHER or ECARUCA symptom–CNV sets alone, and thus are novel, and 718 (34%) have been previously associated with human disease within OMIM. While 101/167 (60%) of the novel candidate genes contribute to the 41 novelly associated human symptoms (see above), 68 genes (40%) are identified from associations previously detected in either the DECIPHER or ECARUCA sets alone, but which have been extended across additional CNVs in the combined symptom–CNV sets.

Given that the patients considered here present, on average, seven, rising to 102, symptoms each, and 1,156 (78%) patients possess multiple CNV candidate genes, among patients with multiple candidate genes, we asked whether these candidate genes were all associated with a single symptom or else whether different candidate genes were associated with distinct symptoms. Strikingly, of the 1,258 patients for whom we can identify a candidate gene, 1,147 (91%) have multiple candidate genes associated with a single symptom, showing this is a general feature across all these patients' CNVs and not simply a consequence of the significantly larger ECARUCA CNVs (Table 1). Indeed, the median number of candidate genes per patient per symptom is 2. For the converse scenario, we identified 18 DECIPHER and 25 ECARUCA patients possessing multiple candidate genes, where different genes were associated with distinct symptoms (Fig. 4 and Supp. Table S7). These patients represented only 4% of those patients with multiple candidate genes, while the combined DECIPHER and ECARUCA analyses identified only a further four patients.

Figure 4.

The molecular dissection of individual CNVs by human symptoms. For a subset of patients, we identified distinct non-overlapping associations among the affected genes that suggest an additive pathoetiology. Two examples from a total of 47 identified are presented. For each patient, the candidate gene(s) associated with each symptom are shown.

Finally, we examined the diversity of symptoms that each candidate gene's copy number change contributes to. This is inevitably dependent on the symptomatic detail within the LMD definitions used to describe these patients' clinical presentations (Supp. Fig. S1). Nonetheless, we found that a large proportion of the candidate genes contribute to multiple symptoms, whatever level of symptom specificity within the LMD hierarchy we considered. Of 2,086 candidate genes, 1,721 (83%) each contribute to multiple symptoms distinct at the most specific level of the LMD, while 1,407 (67%) each contribute to multiple symptoms distinct at the intermediate level, and 934 (45%) each contribute to multiple symptoms distinct at the most general level.

Recurrently Copy Number Variable Regions Observed in Patients with a Shared Symptom Identify Additional Mouse Phenotype Enrichments

The DECIPHER and ECARUCA databases often help clinicians robustly associate a recurrently affected locus observed in patients with those patients' shared symptoms [Feenstra et al., 2006; Firth et al., 2009]. Accordingly, we re-examined only those regions observed to be copy number variable in multiple patients sharing a symptom in the combined DECIPHER and ECARUCA datasets for enrichments of genes associated with particular mouse phenotypes. By investigating only regions that occur more than once in patients presenting with a particular symptom, we aimed to reduce any “noise” contributed by the large number of genes within our datasets that may not be causally related to the human symptom.

Among 385 combined symptom–CNV recurrent region sets, 45 (12%) were found to harbor genes that were significantly associated with a mouse model phenotype, among which 28 (62%) human symptoms had not been associated in the previous analyses (Supp. Table S8). For 16 of the 17 human symptoms that had mouse model associations in the previous analyses, the enrichment was higher in the recurrent region set (Supp. Tables S4 and S8). Although these findings show that recurrency is a powerful aid in identifying disease associations, the proportion of patients considered in each analysis for whom we can identify a candidate gene is higher (80%) in our original analysis than the recurrent region analysis (58%), illustrating that dispersed loci provide significant, and also complementary, additional power.


Our results show that de novo CNVs identified in patients with developmental disorders exhibit significant biases in their protein-coding gene products. This has enabled us to associate mouse model phenotypes with a number of human symptoms observed in patients with developmental disorders (Fig. 3; Tables 2 and 3; Supp. Tables S2 and S3). The mouse model phenotype enrichments among genes within a symptom–CNV set suggest a causal role in those patients symptoms for both those genes that contribute to these enrichments and the CNVs that harbor them.

Many of the mouse model phenotype enrichments are directly comparable to the human symptom under investigation, particularly for human anatomical malformations, providing confidence in our findings beyond the statistical significance (Figs. 3A–3D; Tables 2 and 3; Supp. Tables S2–S4 and S8). For example, CNVs identified in patients with the symptom “low birth weight” are enriched in genes associated with decreased fetal size phenotypes in mouse models (Fig. 3A), while CNVs identified in patients with malocclusion are enriched in genes associated with malocclusion in mouse (Fig. 3B). Other mouse model phenotype enrichments are less directly comparable to the human symptom under investigation, and may be novelly identifying disrupted biological processes/systems underlying those patients' symptoms. For example, CNVs observed in patients presenting with “complex partial seizures” are enriched in genes associated with an abnormal circadian rhythm phenotype in mouse (Fig. 3E), which fits well with the observation that patients with complex partial seizures frequently suffer seizures at a set time during their sleep–wake cycle [Yalyn et al. 2006]. Similarly, CNVs from patients with psychotic behavior are enriched in genes associated with a decreased prepulse inhibition phenotype in mouse, and it has been observed that patients prone to developing psychosis exhibit decreased prepulse inhibition (Fig. 3F) [Kumari et al., 2008].

Although the similarities between the human symptoms and the mouse phenotypes are clear, there are challenges in the direct comparison of the human patient with the relevant mouse model due to differences between the mouse and human genotypes. Although the mouse phenotypes result from homozygous gene knockouts, the copy number variable genes observed in the human patients are, as far as we know, either heterozygous deletions or gene duplications. Fundamentally, if there were no correlation between the phenotypes resulting from genes in differing abnormal copy number, then we would not expect to observe any significant associations except between mouse models and human patients with the same copy number change. Clearly, the numerous and readily comparable associations we find suggests that multiple variations of these genes' copy number are likely to affect the same biological processes. Indeed, it has been observed that many microdeletion syndromes have reciprocal microduplication syndromes at the same loci that affect the same organ system—for example, Smith–Magenis syndrome (deletion) and Potocki–Lupski syndrome (duplication) [Elsea and Girirajan, 2008; Yatsenko et al., 2005], and the mirrored body mass index phenotypes associated with gene dosage at the 16p11.2 locus [Jacquemont et al. 2011]. Disappointingly, despite the heterozygous null mouse model being made, the phenotype(s) resulting from the heterozygous disruption is often not recorded, irrespective of clear medical interest. Nonetheless, some information is available for 1,226 (23%) of the mouse models considered here. Reassuringly, the candidate genes identified here are extremely significantly enriched for those genes whose mouse ortholog heterozygous disruptions are annotated as haploinsufficient (+38% enrichment; P < 10−16).

Of the 46 human symptom–CNV sets with significant mouse phenotype enrichments within DECIPHER, 10 are obtained through CNVs observed in a single patient, while within ECARUCA, 17 symptom–CNV sets have enrichments from CNV(s) from a single patient (Supp. Tables S2–S4). An enrichment from a single patient undermines the generalizability of these associations. However, these enrichments are only detectable because these particular patients' CNVs possess multiple genes that are all associated with a reasonably specific mouse model phenotype (median number of candidate genes = 3.5) and thus remain of clinical interest.

In any FEAs exploiting a nonuniformly sampled source of functional annotations, a concern should be as to whether the enrichment is found simply because the genes considered have received particular experimental consideration. When generating a knockout mouse model, the gene under investigation is usually chosen due to a hypothesis that it will be biologically interesting. Thus, it could be argued that through our analysis of the mouse phenotypes, we are only examining genes within our patients already thought to be involved in human disease, and therefore not identifying novel disease genes. However, when we examine the annotations of these candidate genes in OMIM, we find that only 35% have been previously associated with human disease, and thus this appears not to be the case in this study (Supp. Table S6).

The mouse model phenotype enrichments identify candidate genes for at least one symptom observed in 54 and 96% of DECIPHER and ECARUCA patients, respectively. As more knockout mouse phenotypes are identified by the MGI database, we expect this candidate gene list to increase. Of the 2,086 candidate genes identified, only 17% contribute to a single human symptom, whereas 83% contribute to the mouse phenotype enrichments observed in more than one human symptom–CNV set (Supp. Table S6). This suggests that for many patients, their overall symptomatic presentation is caused by pleiotropic effects of copy number variable gene(s). As the median number of candidate genes per symptom per patient is 2, individual human symptoms may also be the result of the combinatorial effects of disrupting more than one gene. Over half of the candidate genes identified are for “mental retardation” (Supp. Table S6), the most common symptom within both DECIPHER and ECARUCA, presented by more than 90% of patients. Indeed, mental retardation is one of only two human symptoms whose mouse model phenotypic associations were replicated between these two large datasets, although we have only looked for replicated associations rather than seeking to validate associations through nominal significance (Supp. Tables S3 and S4). The little overlap between the two sets in their other significant associations may reflect variations in the underlying pathoetiology, the substantial variation in human clinical and mouse model annotations, and variation in structural variant identification methodology represented within these two large databases. However, the increases in number, strength, and/or specificity of enrichments by either combining DECIPHER and ECAURCA patients into a single set or by considering only those recurrently affected regions suggest that much of the substantial sources of noise can be overcome as the number of patients increases.


Our large-scale computational analyses of de novo CNVs held by two of the largest clinical structural variant databases has objectively discovered hundreds of human symptom/mouse model phenotype associations along with the identification of over 2,000 genes that contribute to them. These associations and candidate genes generate hundreds of causal hypotheses of relevance to various human disorders that we present here as a clinical resource. We see no reason as to why this methodology cannot be applied to other large CNV datasets [Cooper et al., 2007; Kaminsky et al., 2011] and the results used to improve predictive CNV pathogenicity scoring tools [Hehir-Kwa et al., 2010]. Most importantly, our findings illustrate the significance of centralized clinical data collections, consistently annotated and generally accessible, in facilitating large-scale genomic approaches to understand human diseases.


We are very grateful to the DECIPHER team, especially Nigel Carter, Helen Firth, and Manuel Corpas, and ECARUCA, especially Jayne Hehir-Kwa, for their help in accessing data. We are very grateful to all those patients and their families who contributed to DECIPHER and ECARUCA along with all the clinicians and genetic counselors who facilitated and annotated these data.