Genetic modifiers of long‐term survival in sickle cell anemia

Abstract Background Sickle cell anemia (SCA) is a clinically heterogeneous, monogenic disorder. Medical care has less‐than‐optimal impact on clinical outcomes in SCA in Africa due to several factors, including patient accessibility, poor access to resources, and non‐availability of specific effective interventions for SCA. Methods Against this background, we investigated 192 African participants who underwent whole exome sequencing. Participants included 105 SCA patients spanning variable clinical expression: a “long survivor” group (age over 40 years), a “stroke” group (at least one episode of overt stroke), and a “random” group (patients younger than 40 years without overt cerebrovascular disease). Fifty‐eight ethnically matched homozygous hemoglobin A controls were also studied. Findings were validated in an independently recruited sample of 29 SCA patients. Statistical significance of the mutational burden of deleterious and loss‐of‐function variants per gene against a null model was estimated for each group, and gene‐set association tests were conducted to test differences between groups. Results In the “long survivor” group, deleterious/loss‐of‐function variants were enriched in genes including CLCN6 (a voltage‐dependent chloride channel for which rare deleterious variants have been associated with lower blood pressure) and OGHDL (important in arginine metabolism, which is a therapeutic target in SCA). In the “stroke” group, significant genes implicated were associated with increased activity of the blood coagulation cascade and increased complement activation, for example, SERPINC1, which encodes antithrombin. Oxidative stress and glutamate biosynthesis pathways were enriched in “long survivors” group. Published transcriptomic evidence provides functional support for the role of the identified pathways. Conclusions This study provides new gene sets that contribute to variability in clinical expression of SCA. Identified genes and pathways suggest new avenues for other interventions.


Contents
. Workflow of the data analysis.       A, unusual gene-specific frequency difference between SCA with stroke, and the rest of SCA patients. B, gene-specific signal of unusual genetic difference between "long survivor" patients and the rest of SCA patients. C, unusual differentiation between SCA with stroke and "long survivor" SCA patients. Table S3 displays the results of Cameroun control versus each the above SCA group. Figure S9. Significant pathways and Biological processes associated to gene-specific difference in SNPs frequency between "long survivor" and other SCD patients. A. Significant pathways associated to genes exhibited a large unusual gene-specific difference between "long survivor" group and other SCD patients. B. diagram of the top significant pathway in panel A. C. Significant biological processes associated to genes with unusual departure of gene-specific exome wide difference between "long survivor" and other SCD patients. 2

Introduction
The African continent harbours the greatest genetic and environmental diversity 1, 2 and has the highest health burden per capita 3 , yet there is a scarcity of large-scale disease-specific genome studies of African populations 4,5 . Sickle cell disease (SCD) has its highest burden in Africa, particularly in Central and West Africa. Here, we report the first study on deleterious variants in deep whole exome sequencing of 192 individuals. This is the first data on whole exome sequencing landscape of SCD in Africa, which has generated a novel database of candidate modifier genes. The exome data provided will serve the global community as it represents the first step of a series of studies on the genomic architecture of SCD, which will be enhanced by the establishment of major NIH-funded research consortia for the study of SCD in Africa 6 . Figure S1 describes the overall pipeline used for data generation and analyses.

Ethics Statement
The study was performed in accordance with the guidelines of the Helsinki Declaration. Ethical approval was given by the National Ethical Committee Ministry of Public Health, Republic of Cameroon (No 033/CNE/DNM/07) and the University of Cape Town, Faculty of Health Sciences Human Research Ethics Committee (HREC RE: 132/2010). Written and signed informed consent was obtained from the adult participants who were 18 years or older, and for the children consent was obtained from parents/guardians with an assent from the participants older than seven years of age.

Patients and assessment of clinical events
The recruitment for the discovery group was conducted in Cameroon at the Yaoundé Central Hospital and Laquintinie Hospital in Douala, as previously described 7 . Briefly, socio-demographic and clinical data were collected by means of a structured questionnaire. Anthropomorphic variables were measured in the outpatient setting. Routine blood counts of patients and haemoglobin (Hb) electrophoresis were conducted on arrival at the hospital. In the present study, three sub-groups of SCD patients were included: 1) the "stroke" group made of SCD patients with at least one clinical episode of overt stroke, a devastating complication of SCD, occurring in 11% of patients before age 20 years and considered to be a proxy of severity 8 , influenced by genetic modifiers 9 ; 2) the "long survivor" SCD group made of patients older than 40 years considered here as the most genetically fit; whose cut off was based on life expectancy of 43 years for SCD-HbSS in the cooperative study conducted four decades ago in the USA8; 3) the "random" group made of SCD patients randomly selected among clinical stable patients without incidence of any cerebrovascular disease by clinical criteria and younger than 40 years.
The replication cohort consisted of adult SCD patients recruited at the Haematology Clinic, Groote Schuur Hospital in Cape Town, that are mostly recent migrants from other sub-Saharan African countries where SCD is prevalent 10 . In order to have a similar genetic background for this replication group, only patients from Demographic Republic of Congo (DRC) were included in the present study. Details of sample information are in Table S2 2

.3 Control participants
A total of 58 ethnically matched Cameroonian controls were randomly recruited from healthy blood donors in Yaoundé 11 and volunteered their participation in the study. Only individuals, without HbS mutation and that were homozygous HbAA as confirmed by molecular analysis, were included.

Molecular methods: Sickle cell anaemia mutation, β-globin gene cluster haplotypes, and 3.7 kb α-globin gene deletion
DNA was extracted from peripheral blood following the manufacturer's instructions (Puregene Blood Kit; Qiagen, Hilden, Germany). Molecular analysis was performed to determine the presence of the sickle mutation and was carried out on 200 ng DNA by polymerase chain reaction (PCR) to amplify a 770 bp segment of the β-globin gene. This was, followed by DdeI restriction analysis of the PCR product 12,13 . Using published primers and methods, five restriction fragment length polymorphism (RFLP) sites in the β-globin gene cluster were amplified to analyse the XmnI (5'Gγ), HindIII (Gγ), HindIII (Aγ), HincII (3ψβ') and HinfI (5'β) for the HBB haplotype background 11 . The 3.7 kb α-globin gene deletion was successfully screened, using the expand-long template PCR (Roche Diagnostics, Basel, Switzerland), as previously published 11,14 . sequencer. WES was generated on the HiSeq X.

Reads Mapping and Alignment
To insure the forward and reverse reads are of high quality and appropriate length, we evaluate these using used both FastQC 15 and SolexaQA++ 16 .
From the UCSC database 17 , we obtained the human reference genome, version hg19 (build37), together with its gene annotation. The reads were aligned to the UCSC hg19 (build37) complete reference genome using BWA 18,19 . Using the Picard tool kit 20 , the duplicate reads were marked, and after alignment, the BAM files were sorted by coordinates, indexed and read groups were added via Picard 20 . BAM files were re-ordered according to UCSC hg19 17 . Insertions and deletions at the end of the reads can misguide the aligners into mis-aligning with mismatches. This artificial mismatch can mislead base quality score recalibration and variant detection. To address this issue, we used the Genome Analysis Toolkit (GATK) software 20 for local realignment along all reads at a problematic locus to create a cleaned version of the BAM file and found a best consensus sequence that, together with the reference, best fits the reads in a pile. We used the 1000 genomes phase 1 INDELs and Mills and 1000 genomes gold standard INDELs 21 to drive the process. Using these known sites improves the accuracy. After INDEL realignment we applied Picard "FixMateInformation" to recalculate read pair information to see if it has changed. At this stage all 192 samples have > 79% of target bases covered to ~35x.

Variant Calling
As different calling methods produce a large number of differing variants and previous studies have demonstrated that these methods have differing advantages 22,23,24 , we adopted an ensemble approach implemented in VariantMetaCaller 25 in each data set of subjects group and the all dataset of 192 subjects. To detect SNPs and short indels, we combined information generated from three independent variant caller pipelines ( Figure S1) Figure S1). The best practice specific to each caller were adopted 28 .

Variant Calling Quality Control and Final Call-set
Before applying the ensemble approach from the resulting variant sets per subject group and all subjects from each these three callers respectively, we filtered each resulting VCF files using the GATK tool "VariantFiltration".

Flagging Variants
We added additional filter levels to each call set as follows: (1) If 3 SNPs are detected within a window of 10 base-pairs, the site will be flagged as a "SNPCluster" in the FILTER column (2) if 4 or more alignments having a mapping quality of MQ = 0 (which means it maps to different locations equally well) and the number of alignments that mapped ambiguously are more than a tenth of all alignments, it is difficult to decipher artefacts and true 37 differences. These sites will be flagged as "HARD_TO_VALIDATE", (3) SNPs which are covered by less than 5 reads may be potential artefacts and these sites was flagged as "LowCoverage", (4) SNPs having a SNP quality below 30 are typically artefacts, were flagged as "VeryLowQual", (5) SNPs having a quality score between 30 and 50 are potential artefacts, flagged as "LowQual", (6) SNPS having a QD score < 1.5 are indicative of false positive calls and artefacts, flagged as "LowQD" (7) and SNPs covered only by sequences on the same strand are often artefacts, was flagged as "StrandBias".

Variants Quality Control Assessment Prior Final Call-set
Variants flagged "VeryLowQual", "LowQual", "LowQD" and "StrandBias" were removed in each VCF files. Exomes VCF files were assessed according to the total reads, coverage distribution, raw error rates, transition/transversion (Ti/Tv) ratios (3.2), comparison of genomic sex to recorded sex, distribution of known variants (relative to dbSNP), CytoSNP array fingerprint concordance > 99%, homozygosity, heterozygosity, and sample contamination validation. Additionally, variant sites that strongly deviate from Hardy-Weinberg equilibrium (p-value < 5x 10-5) were flagged. These criteria reduce the inclusion of false-positive variant calls during the ensemble of the VCF files.
The final call-set from each subjects group, were produced from VariantMetaCaller 25 , a support vector machines approach that combined the hardfiltered VCF files obtained from these three variants callers.  45 . From each resulting functional annotated data set, we independently filtered for predicted functional status (of which each predicted functional status is of "deleterious" (D), "probably damaging" (D), "disease_causing_automatic" (A) or "disease_causing" (D). 46 We used a casting vote approach, by retaining only a variant if it had at least 17 predicted functional status "D" or "A"out of 21. Second, the retained variants from each data set were further filtered for rarity, exonic variants, and nonsynonymous mutations and with high quality call as describe above, yielding a final candidate list of predicted mutant variants in each subject group, including the replication group. To compare the results from the above strategy, we re-applied FATHMM 22 , a disease-specific weighting scheme, which uses a Hidden Markov Models prediction algorithm capable of discriminating between disease-causing mutations and neutral polymorphisms. We report on the aggregated SiPhy score from all identified mutants SNPs within gene.

Reconstruction of Sub-network
To identify sub-networks of interactive mutant variant genes (section 7 above) or from gene-specific difference in SNPs frequencies genes (section 13 below) from each SCD category occurred, we used a comprehensive human Protein-Protein Interaction (PPI) network 50,51,52,53,54,55 to analyse how each set of variant genes are layered and interact within a biological network, thus extracting a sub-network. A clustering script in R's (R Core Team, 2016) 39 igraph package was used to determine a network plot that would allow us to identify the hub proteins in the sub-networks.

Enrichment of genes within Sub-network
We examined these candidate variants genes from either mutation prioritization (section 7 above) or gene-specific difference in SNPs frequencies (see section 13 below) are interact with other genes and how in the formed sub-network they can be associated with human phenotypes and what are their pathways, biological processes and molecular functions. To address this, we used a custom script and adopted to compare our results with different tools including DAVID 56 and PANTHER 57 . We additionally conducted an enrichment analysis using the Enrichr software 58 . The most significant pathway enriched for genes in the networks were selected from KEGG 59 , Panther 57 , Biocarta 60 and Reactome 61 , and gene ontologies from the Gene Ontology Consortium 62 were defined for cellular component, biological process and molecular function.

Further Characterization of Enriched Sub-networks.
To identify the association between each sub-network Sj, (j=1,…,T) within n1,..., nT genes to human pathway, the set of human pathways. We obtained 1,547 annotated pathways 51,52,53,54,55 and collected more than 107 annotated pathways from the KEGG 59 , BioCarta 60 and Ambion GeneAssistTM Pathway Atlas pathway databases. We downloaded genomic co-ordinates for all genes from the NCBI ftp-server 45 and retained only entries for the human reference sequence. We assigned the SNPs located within a gene or in LD within less than 40kb distance up/downstream of the gene using dbsnp database 45 . Let a number of genes in the intersection between genes within Sj and genes within pathway . Let b be the number of genes in the intersection between genes within Sj and those in the union of all pathways . Let n be the number of genes in the intersection between genes in the pathway and those in the union of all pathways with , and m be the total number of genes in all pathways . We computed the statistic of significance of overlap between sub-network Sj, of nt genes and a given pathway using the z-score (ZS), which employs the binomial proportion test, 40 10. Procedure to Aggregating SNPs Summary Statistics at the Gene level.
From each subject group, we estimated the statistical significance at gene level from the list of resulting genes associated to predict mutant variants (section 7 above) or from the list of candidate gene-specific difference in SNP frequencies (section 13 below). In doing so, we aggregated the P-values from SNPs 40kb downstream and upstream gene region (exon) as per gene-based annotation file in our exome datasets. Under the null hypothesis, pvalues pk (k = 1,...,L) for a test-statistic with a continuous distribution are uniformly distributed in the interval [0,1]. It follows that a parametric cumulative distribution function F can be chosen and pk can be transformed into quantiles according to . The combined test statistic is a sum of independent and identically distributed random variables . To account for the independence assumption given correlation among neighbouring genomic markers, we implement the Stouffer-Liptak method accounting for spatial correlations among SNPs within a gene or SNPs within a given sub-network. The overall statistic can be obtained by , in which is the cumulative distribution function of the standard normal distribution. We apply the Benjamini-Hochberg false-discovery correction method to control the type I error rate and account for gene/subnetwork differences in the numbers of associated SNPs. From each subject group, we reported on the overall statistical significance of gene mutational burden gene, p-values after the Benjamini-Hochberg false-discovery correction. 41

Phased and Haplotypes Inference
From the VCF file resulted from joint variant calling (section VI above), a merged dataset of all 192 samples (58 Hb AA Cameroon controls, a replication cohort of 29 SCD patients from DRC, 56 random, 26 Stroke, and 23 survival SCD patients and 58 controls from Cameroon. We further conducted quality control to remove all structured, indel, multi-allelic variants and those with low minor allele frequency (MAF < 0.05) prior to phasing. We first phased and inferred the haplotypes using Eagle 66 from the resulting curated data.
Second, we extracted and re-phased samples from the 1000 Genomes Project 21 . To merge our exomic haplotypes to those from 1000 Genomes, we computed a cross-imputation using impute2 67 . We performed further quality control after the imputation, which consisted of removal of variants with low minor allele frequency (MAF < 0.05); low genotypes call (<= 95%) and imputation accuracy (< 955) prior to re-phasing using Eagle66. We performed a post-phasing quality control in which we checked the switch-error between our exomic haplotypes panel and the exomic haplotypes merged to 1000 Genomes panel 21 , where 99.7% and 97,05% of the sites were with no phase switch-error in both panels, respectively. We further compared sites discordance between these haplotype panels and independently with their original VCF file prior phasing. The only site with phase switch-errors showed discrepancies in MAF and were therefore removed.

Fine-scale Population Structure
Population structure analyses were performed to characterize the genetic contributions to our 192 patient samples. We first tested whether populations conform to an isolation-by-distance model or whether there is strong heterogeneity among populations relative to geographic distance. To this end, we have merged our 192 samples with data from 1000 Genomes Projects in which we performed principal component analysis using smartpca in the EIGENSOFT package 68,69 and included all SNP haplotypes shared between the populations analysed. Population-specific pair-wise genetic distance (F_ST) and a phylogeny tree was computed from smartpca. All the 192 samples, including the SCD from DRC were observed to cluster together, 42 indicating geographic and genetics proximity. Table below displays the population-specific F_ST, indicating close genetic relatedness between SCD groups from Cameroon as well as relatedness between SCD patients from DRC and the Control Cameroon group. The data suggests an isolation-bydistance model.

Unusual Genetic Difference: SNP-specific differences in allele frequency
Similarly, to 70,71 , and assuming the population evolved under the Wright-Fisher model, selective neutrality and with an expected number of mutations, we used a step-wise constant effective population size 70 to compute allele frequency spectrum. We secondly computed the pair-wise group frequency spectrum difference at SNP level, thus SNP allele frequency spectrum difference are assigned to a given gene if they are located within the gene's 40kb downstream or upstream using dbsnp database, thus aggregating SNPs allele frequency difference into genes 51,52,53,54,55 . To this end, let and be the allele frequency spectrum in group 1 and group 2, respectively. To minimize deviation from the normality assumption, SNPs with minor allele frequencies < 0.05 are excluded. Thus, at a given locus i, the difference between observed variant allele frequencies of two groups 1 and 2. Let gene j (j = 1,.., N) has associated SNPs, thus is gene j (j=1,..,N) frequency difference from group 1 and 2, and can be approximated as a normal distribution under neutral drift with mean 0 and variance 71,72 where is the genetic distance between the group 1 and 2 total variant allele counts in each population, and p is the overall gene-specific ancestral frequency that is approximated as the average of the two observed variant allele frequencies from SNPs associated to gene j. Similar to 70,71 , we test unusual gene-specific difference U12 from groups 1 and 2 as follows 44 This equation is χ² distributed with 1 degree of freedom (d.o.f). An excess of large values of the χ² statistic indicates deviations from the null model, suggesting the action of natural selection 71,72 . We applied this method to each group-pair from the three groups of SCD. Finally, SNP-specific unusual allele frequency summary statistics of SNPs within gene region can be aggregated (see section 10 above) to obtain gene-specific differences in SNP frequencies.