Properties of human genes guided by their enrichment in rare and common variants

Abstract We analyzed 563,099 common (minor allele frequency, MAF≥0.01) and rare (MAF < 0.01) genetic variants annotated in ExAC and UniProt and 26,884 disease‐causing variants from ClinVar and UniProt occurring in the coding region of 17,975 human protein‐coding genes. Three novel sets of genes were identified: those enriched in rare variants (n = 32 genes), in common variants (n = 282 genes), and in disease‐causing variants (n = 800 genes). Genes enriched in rare variants have far greater similarities in terms of biological and network properties to genes enriched in disease‐causing variants, than to genes enriched in common variants. However, in half of the genes enriched in rare variants (AOC2, MAMDC4, ANKHD1, CDC42BPB, SPAG5, TRRAP, TANC2, IQCH, USP54, SRRM2, DOPEY2, and PITPNM1), no disease‐causing variants have been identified in major, publicly available databases. Thus, genetic variants in these genes are strong candidates for disease and their identification, as part of sequencing studies, should prompt further in vitro analyses.


Statistics
The χ 2 test was used to compare observed and expected frequencies for categorical values. Comparison of medians between two categories was performed using the Mann-Whitney-Wilcoxon test. For comparison between three categories the Kruskal-Wallis Rank Sum Comparison was used to calculate P values. Identification of genes in which disease-causing variants occur more often than expected (genes enriched in diseasecausing variants) was done using the hypergeometric test on 17,975 genes in which at least one variant, deleterious or non-deleterious was present. Each gene was assessed against all others. 17,975 p-values were obtained and corrected using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) (total number of tests=17,975). Identification of genes in which rare or common variants occur more often than expected (genes enriched in rare or common variants), was done using the hypergeometric test on 17,902 genes in which at least one variant, rare or common was present. Each gene was assessed against all others. 17,902 p-values were obtained and corrected using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) (total number of tests=17,902). Results were considered significant if a corrected two-sided P value was <0.05.

Genes enriched in disease-causing variants
Number of genes with at least one disease variant 2,631 Number of genes with at least one non-disease variant 17,902 Number of genes with at least one disease or non-disease variant 17,975 Total number of calculated and corrected p-values 17,975

Genes enriched in rare or common variants
Number of genes with at least one rare variant 17,540 Number of genes with at least one common variant 15,391 Number of genes with at least one rare or common variant 17,902 Total number of calculated and corrected p-values 17,902

SUPPLEMENTARY RESULTS, FIGURES AND TABLES GO terms and cellular pathways in three gene datasets
In order to obtain a function-driven understanding of the similarities and differences in the genes belonging to the rare-EV and common-EV sets, we mapped these to cellular pathways. Genes enriched in rare variants were more likely (p<0.01) to be involved in "signal transduction pathways", similarly to genes enriched in disease-causing variants ("signal transduction", pathways" and "metabolism"), whereas genes enriched in common variants were annotated as involved in "immune system pathway" (p<0.01). We also categorized each gene in the three sets by using the Gene Ontology (GO) classification (Gene Ontology Consortium, 2015). Genes in the disease-EVset and rare-EVset were again significantly (p<0.05) more likely to be involved in core biological processes (namely "metabolic process" and "biological regulation" for genes in the disease-EVset and "cellular process", "biogenesis" and "catalytic activity" genes in the rare-EVset) compared to genes in the common-EVset when GO terms were examined. Nevertheless, genes in the common-EVset were more likely to be involved in " cellular components", biological adhesions" and "developmental and cellular processes" compared to the disease-EVset.

Supp. Figure S2 CADD C-scores for missense and nonsense variants in 12 genes enriched in rare variants.
The violin plots show the median C-scores for A) missense and B) nonsense (stop-gained) variants. * , de-novo heterozygous variant -clinical significance unknown , ** , de-novo variant -clinical significance unknown; n.a., not available.

Supp. Table S5
Twelve genes enriched in rare variants (rare-EVset) that have no short genetic variants reported to be associated with disease in OMIM, UniProt, ClinVar or the GWAS Catalog (large deletions and insertions >50Kb were not included in the analysis). pLi scores were extracted from the ExAC database. A pLi score ≥0.9 is indicative of the gene extreme intolerance to loss of function variations. n.a., not available. *, Function description was adapted from the UniProt database.

Supp. Table S6 In silico predictions for missense variants by SIFT, Polyphen2 and MSC-corrected CADD scores. Variants are reported
"predicted damaging" if above the default SIFT score, if assigned to "probably" or "possibly damaging" by PolyPhen-2 or if the CADD score was equal or above the gene specific MSC.

Supp. Table S8 Small biological distance between the 12 genes enriched in rare variants calculated using the human gene connectome
(available at http://lab.rockefeller.edu/casanova/GDI).
'Distance', small biological distance; 'Rank', ranking of the target gene compared to all human genes in the query gene specific connectome; 'BRP', best reciprocal p-value or smallest of the mutual p-values between the query and target gene; 'Median ratio' and 'Average ratio', the median and average distance between the query gene and all human genes; 'Sphere', the sphere of the target gene around the query; 'Degrees of separation', the number of nodes between the query and target genes. For a comprehensive explanation of each term please refer to Itan et al. (Itan et al., 2014).