Dysfunctional DNA repair pathway via defective FANCD2 gene engenders multifarious exomic and transcriptomic effects in Fanconi anemia

Abstract Background Fanconi anemia (FA) affects only one in 130,000 births, but has severe and diverse clinical consequences. It has been theorized that defects in the FA DNA cross‐link repair complex lead to a spectrum of variants that are responsible for those diverse clinical phenotypes. Methods Using NextGen sequencing, we show that a clinically derived FA cell line had accumulated numerous genetic variants, including high‐impact mutations, such as deletion of start codons, introduction of premature stop codons, missense mutations, and INDELs. Results About 65% of SNPs and 55% of INDELs were found to be commonly present in both the FA dysfunctional and retrovirally corrected cell lines, showing their common origin. The number of INDELs, but not SNPs, is decreased in FANCD2‐corrected samples, suggesting that FANCD2 deficiency preferentially promotes the origin of INDELs. These genetic modifications had a considerable effect on the transcriptome, with statistically significant changes in the expression of 270 genes. These genetic and transcriptomic variants significantly impacted pathways and molecular functions, spanning a diverse spectrum of disease phenotypes/symptoms, consistent with the disease diversity seen in FA patients. Conclusion These results underscore the consequences of defects in the DNA cross‐link repair mechanism and indicate that accumulating diverse mutations from individual parent cells may make it difficult to anticipate the longitudinal clinical behavior of emerging disease states in an individual with FA.

At least 22 genes have been identified to be coding for the FA pathway, classified into complementation groups (Che et al., 2018;Cheung & Taniguchi, 2017;Nalepa & Clapp, 2018). All these complementation groups have been identified as biallelic germline mutations that cause FA, with the exception of FANCB and FANCR (Rad51) (Ameziane et al., 2015;Meetei et al., 2004). One of the genes, originally named FANCD1, turned out to be BRCA2 (OMIM #600185) in which mono-allelic mutations cause susceptibility to breast and other cancers, while biallelic mutations lead to FA (Howlett et al., 2002;Mathew, 2006). These genes in FA patients exhibit a wide spectrum of mutations, including deletions, frameshifts, stop codons, splice-site mutations, and missense mutations (Joenje & Patel, 2001). The canonical FA signaling pathway is often divided into three parts (Che et al., 2018). Part I consists of nine known FA proteins (FANCA, B, C, E, F, G, L, M, T, and possibly I) together with FAAPs (FAAP 20/24/100 and MHF1/2) and other known proteins, responsible for the activity of ubiquitin E3 ligase for the monoubiquitination of FANCD2 (OMIM #613984) and FANCI (OMIM #611360). Part II, the FA ID complex, comprises FANCD2 and its paralog FANCI and forms a central axis to connect and orchestrate the entire FA signaling pathway (Che et al., 2018). Part III, the functional units downstream of part II, contains DNA repair proteins that act together following the activation/monoubiquitination of FANCD2/FANCI. Despite the considerable progress in dissecting the genetic and molecular mechanisms of FA signaling, the pathway, especially its parts I and III, likely includes many other genes that have yet to be identified (Che et al., 2018).
In this work, we undertake an effort to understand the genomewide and transcriptome-wide downstream effects of a dysfunctional FANCD2 gene in PD20 cell cultures. We hypothesize that one of the downstream consequences of FA-induced ICL is disrupted transcription, which can lead to altered transcriptional products. It has been shown that transcriptional products can be altered in cancer cells (Silva et al., 2010). A significant level of transcriptome instability has been shown to always be present in cells, and several stabilizing mechanisms are known to be in place to respond to them (Radhakrishnan & Green, ). However, in the absence of a functional FA pathway, we propose that an increased level of unrepaired ICLs can overwhelm the transcriptome stability-maintaining mechanisms and lead to altered transcriptional products, in which case, consequent impairments of cellular function are imminent. While the ICL repair function of the FA pathway is well known, there are studies that suggest that the FA pathway is involved in the general upkeep of genomic stability (Michl et al., 2016). Additionally, microsatellite instability (MIS) has been linked to dysfunctional mismatch repair and recent studies show the contribution of the cross-link repair (FA) pathway to microsatellite instability (Concannon & Lahue, 2014;Hubert, Lin, Dion, & Wilson, 2011;Lin, Hubert, & Wilson, 2009). Hence, we also aim to understand possible effects of a dysfunctional FA pathway on MIS.

| Ethical compliance
This study has been compliant with the ethical conduct of research and existing regulations.
2.2 | Cell lines and DNA/RNA sample preparation PD20 cell lines containing a defective FANCD2 gene and retrovirally corrected FANCD2 were obtained from the FA cell repository at the Oregon Health and Science University (https://www.ohsu.edu/research/fanconi-anemia/celllines. cfm/forum/index.cfm).
The cells were seeded onto six-well plates at 1.5 × 10 4 cells/well. After ~24 hr of incubation, the medium (5 ml of DMEM-10% FBS) was changed. After 4 hr of incubation, the cells were trypsinized and collected for DNA (Qiagen AllPrep Kit) and RNA (Qiagen RNA prep kit) extraction. DNA and RNA from two PD20 samples and two PD20 RV corrected for FANCD2 gene were sent for exomic sequencing and RNA-Seq analysis (Figure 1).

| DNA variant detection
Paired-end sequencing reads of the four Fanconi anemia DNA samples were obtained in the form of fastq files. The samples had a mean of 30 million reads with a standard deviation of 3 million reads. The samples were then checked for quality using the Trimmomatic tool (Bolger, Lohse, & Usadel, 2014) that utilizes a window-based system to check for the sequencing quality of bases. The tool requires user input for three parameters: sliding window length, quality score threshold, and minimum length of sequencing read. The following default values were used: 10, 20, and 70. Only sequences that contained both mate pair reads were used for further processing. After quality trimming, each sample on average had 27 million reads with a standard deviation of 3 million. The "mem" function of the BWA package was used to map the sequencing reads to the GRCh38 reference genome (Li & Durbin, 2009). On average, 99% of reads in all the samples were mapped to the reference genome. The SAMtools package was used to convert the SAM file into a sorted and indexed BAM file ). The AddOrReplaceReadGroups function in the Picard package was used to add read groups to the BAM files. All four BAM files were pooled together to create INDEL realignment targets using the RealignerTargetCreator function in GATK package. The GATK IndelRealigner function was then used on individual samples to create INDEL realigned BAM files (McKenna et al., 2010). The GATK HaplotypeCaller with default parameter settings was used to create VCF files with INDEL and SNP calls, and the SelectVariants function was used to separate SNPs and INDELs. The latter were detected in the size range of 1-10 bp.
Python and shell scripts were written to calculate the number of commonly present SNPs and INDELs in the FA and FA_RV samples. The BEDTOOLS coverage function was used to calculate the percentage of the exome that is covered by the reads and the SAMtools depth function (samtools depth sample.bam|awk '{sum += $3} END {print "Average = ",sum/NR}') was used to calculate the coverage of the reads covering the exome (Quinlan & Hall, 2010).
The VCF files were fed to the SNPEff program to annotate the SNP and INDEL calls (Cingolani et al., 2012). The SNPEff annotation output was used to classify the detected variants according to the type and genomic region. The SNPEff program outputs the possible effects of a given variant and classifies the effect into four categories: high, low, moderate, and modifier.
Custom-written Python scripts were used to perform the pairwise comparison of DNA variants that were predicted as high impact by SNPEff. A SNP is considered to be specific to a sample A, if the altered nucleotide is covered by at least three reads in sample A and not present even once in sample B. This base-by-base comparison was done only for nucleotide positions that were covered by at least 10 reads in F I G U R E 1 Two Fanconi anemia (FA)-PD20 cell lines were included in this experiment. The cell lines represented in yellow are FA cell lines with a dysfunctional FANCD2 gene, while the cell lines represented in pink are FA cell lines that were corrected with a functional FANCD2 gene using retroviral transduction. Variants with respect to the human genome reference found in common among all samples are indicative of variants accumulated prior to extraction from the patient and differences between this individual and the reference. Variants specifically distinguishing the FA and FA_RV sample groups are indicative of the divergence that likely occurred after retroviral correction in the FA_RV sample group. Variant differences within the sample groups (cell culture replicates) indicate the divergence of each of the cell line cultures, as different uncorrected errors accumulate in each sample both samples. The SAMtools mpileup function was utilized to check whether genomic positions that span a variant call in one sample are covered in the other sample. Control biological replicate exomes (SRR2776256 and SRR2776257), to demonstrate the validity of this SNP comparison procedure, were obtained elsewhere (Linderman et al., 2014).

| RNA-Seq data processing and analysis
Single-end sequencing reads from RNA samples were obtained in the form of fastq files. On average, each sample contained 36 million reads with a standard deviation of 1 million. They were quality checked and trimmed using the Trimmomatic tool with the same parameters that were used for the DNA samples. After quality trimming, the samples retained on average 35 million reads (SD 1 million) (Bolger et al., 2014). The quality-trimmed RNA sequence reads were then mapped to the GRCh38 reference genome using Tophat2 with a prebuilt transcriptome index (Trapnell, Pachter, & Salzberg, 2009). On average, 90% (SD: 3%) of the reads were uniquely mapped. The mapped BAM file was passed through the Cufflinks and Cuffmerge programs and was finally passed through the Cuffdiff program as four separate sample groups (Trapnell et al., 2010). The Cuffdiff program generates a pairwise comparison of the FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values for each gene in all four samples, providing six pairwise comparisons. Any gene that was found to have a statistically significant (p value < 0.05) differential expression in comparisons across the FA and FA_RV sample groups, and no significant differential expression within the FA and FA_RV sample groups, was considered to be differentially expressed between the two sample groups.
The Benjamini-Hochberg false discovery rate for the differential expression of the FANCD2 gene was calculated by separately computing the false discovery rate (FDR) for all six possible comparisons of the four samples. These pvalues were extracted from the gene_exp.diff file which is one of the output files that the Cuffdiff program generates. The FDR was computed as follows: Three values were utilized. (a) p value, (b) total number of tests, and (c) rank of that p value when sorted in ascending order. The formula for calculating FDR = p value * (total tests/p value rank). The genes.fpkm_tracking file was used to count the number of exons for each gene using its most expressed transcript.

| Microsatellite Genotyping
Microsatellite (MST) genotyping of the DNA samples was performed using the Repeatseq program (Highnam et al., 2013). The program requires three user inputs: mapped sequencing reads in the form of a BAM file, a list of MST genomic coordinates, and the reference genome. Repeatseq outputs a variant call file listing all possible alleles detected for each microsatellite locus. Custom-written Perl scripts were written to generate a list of MSTs that include primary, secondary, and minor allele information. We followed our established genotyping method to classify a microsatellite into homozygous and heterozygous genotypes (McIver, Fonville, Karunasena, & Garner, 2014;McIver, McCormick, Martin, Fondon, & Garner, 2013). A MST is considered to have a minor allele only if the minor allele was covered by at least three reads.

| Generating the query list of microsatellites
A list of microsatellite loci in version 19 of the human reference genome was generated with a custom Perl script "searchTan-demRepeats.pl" using default parameters. This script has been used in previous microsatellite studies and is available online at https://genotan.sourceforge.net/#_Toc324410847. The genomic coordinates for these microsatellite sequences were converted to their corresponding GRCh38 coordinates. The initial list generated with this script included 1,671,121 microsatellite loci. To mitigate the likelihood of improper read mapping between microsatellites, we removed all subsets of microsatellites possessing the same motif between identical 3′ and 5′ flanking regions. For example, the microsatellites "GCTGC(A) 34 CTTAG" and "GCTGC(A) 15 CTTAG" were preemptively removed from our initial list of microsatellites. The fact that there were many of these potentially ambiguous regions is not surprising considering microsatellites are often embedded in larger repetitive motifs, such as LINEs and SINEs. Our final filtered list included 611,515 microsatellite loci. Microsatellites that were covered by at least 10 reads were genotyped and classified into homozygous and heterozygous.

| Gene ontology (GO) analysis
The DAVID online bioinformatics tool was used to perform the ontological gene enrichment analysis (Huang et al., 2007). The DAVID tool was applied to the 270 genes found to be differentially expressed between the FA and FA_RV sample groups. The Reactome pathway online tool was used to find the pathways in which the two sets of genomic variant-associated genes are involved (Croft et al., 2014). The pathways that were found to have significant (FDR < 0.05) involvement were considered for analysis.

| Genomic variant analysis
The four DNA samples, two biological replicates for FA cell line and two biological replicates for the FA_RV cell line, were sequenced and mapped to the human genome version | 1203 VELMURUGAN Et AL. 38 reference to identify genomic variants, such as SNPs and INDELs, and assess their potential impact. Approximately 55% of the INDEL-containing loci were found (callable) in all four samples while 65% of the SNP-containing loci were commonly callable in all four samples (Table 1), thus implying their common origin, that is, that these variants existed in the clinical sample prior to creating the cell lines. About 70% of INDELs and 75% of SNPs were shared by samples within groups (Table 1). The remaining fraction of unshared variants indicate that the biological replicates continue to genetically diverge, that is, develop different variants that go uncorrected, providing a mechanism for enhanced heterogeneity.
Genomic variants in both sample groups were annotated according to their various effects on genes and gene expression (Supporting information Table S1). Variants, such as start-lost, stop-gained, frameshift variant, which can immediately affect gene expression, were detected in both sample groups.
A significant increase in the fraction of INDELs in all four samples was observed, in comparison with the 1,000 Genome Project samples (1 kGP; Table 2) used as non-FA controls. Chromosomal aberrations such as DNA lesions are signature variants of Fanconi anemia (Meyer et al., 2012). The increased fraction of INDEL variants observed in all four FA samples was consistent with dysfunctional correction mechanisms in FA. The fraction of the exome having been sequenced and the sequencing depth were approximately equal in the four FA samples and the 1 kGP samples. The SNP: INDEL ratios in the three different 1 kGP samples were consistently 9.2:0.8, while all FA samples had a higher fraction of INDELs (8.8:1.2).
A set of 82 genes was associated with high-impact genomic variants found only in the FA samples and a set of 618 genes was associated with high-impact genomic variants found commonly in all four samples (FA and FA_RV sample groups), including 17 in 16 DNA repair-related genes (Supporting information Table S2). To explore in detail the heterogeneity in these four samples, pairwise comparisons of SNP occurrences were made (Table 3). SNP-containing loci were compared only if the corresponding genomic location in the corresponding replicate sample was sequenced at sufficient depth for the genotype to be called. Pairwise  Table S3). A similar pairwise comparison for high-impact SNPs was performed on publicly available human biological replicate control exome samples and the two samples contained, on average, two specific mutations, which elucidates the validity of this comparison procedure. These accumulated mutations in FA_RV2 show that most of the SNP variation have accumulated prior to retroviral correction of the FANCD2 gene. It should be noted that such high levels of variance in the genome are consistent with the fact that FA patients are highly predisposed to cancer (Nalepa & Clapp, 2018). Uncorrected interstrand cross-links can directly affect DNA replication by phenomenally increasing DNA errors which can lead to cell death or uncontrolled cell growth (Osawa, Davies, & Hartley, 2011). Unlike SNPs, INDELs were more prevalent in FA than FA_RV samples, suggesting that INDELs may have continued to accumulate in the FANCD2-uncorrected cell lines. FANCD2 deficiency is related to replication fork restart defects (Thompson et al., 2017), and stalled replication forks have been known to result in INDELs (Sankar, Wastuwidyaningtyas, Dong, Lewis, & Wang, 2016). However, it should be noted that this analysis does not rule out effects of other genome repair checkpoints or even side effects of cell immortalization. A total of 17 high-impact mutations in 16 genes from a variety of DNA repair pathways were indeed found in all four samples (Supporting information Table S2).
Microsatellite genotyping is used to understand the effect of a defective cross-link repair mechanism on MST instability. It has been suggested that MST instability is not only caused by the mismatch repair mechanism defects, but could also be caused by defects in nucleotide excision repair and cross-link repair mechanisms (Concannon & Lahue, 2014;Hubert et al., 2011;Lin et al., 2009). The fraction (5%, i.e., 256 MSTs) of heterozygous MSTs in the FA samples was found to be higher than the fraction (3.5%, i.e., 138 MSTs) in the 1 kGP samples, which is consistent with the increase in the fraction of MSTs with minor alleles (Table 4). The fraction of MSTs with minor alleles was 5% (200 MSTs) in the 1 kGP samples, while the fraction of MSTs with minor alleles was 8.5% (437 MSTs) in the FA samples (Table 4).

| RNA sequencing to measure expression changes
Having examined the extent of genomic damage caused by a dysfunctional DNA repair gene, it was pertinent to examine the downstream effect of this DNA damage on genomewide gene expression patterns. A total of 270 genes were found to be differentially expressed in both biological replicates between the FA and the FA_RV sample groups (Table 5 and  Supporting information Table S4). Significant fold changes in expression ranged from 1.1 to 11.5. Out of the 22 FA-related genes, only FANCD2 did have a significant gene expression change between FA and FA_RV sample groups (Table  6 and Supporting information Table S5). The PD20 cells are characterized by an amino acid change (S126G), mis-splicing, and insertion of 13 bp from intron 5 into the FANCD2 mRNA (Timmers et al., 2001). However, the relative difference in FANCD2 expression between FA and FA_RV sample groups likely results from FANCD2 overexpression in the corrected cells. Although thousands of genes had significant high-impact variants in their coding and regulatory regions, only 270 genes were associated with significant expression shifts, indicating that although expression for a given gene was not changed the product structure was.
Structural and sequence variants have been known to influence the genes' expression levels . Of the 82 genes that contain FA-specific high-impact DNA variants, one was also found to be a differentially expressed gene (GUCY1B3; OMIM #139397) (Supporting information Table S4). The C (Cytosine) at chromosome-4: genomic position-155789752 was found to be mutated into an A (Adenine), converting a TAC codon into a TAA (stop codon). This scenario, a single commonality between highimpact DNA variants and genes with a significant gene expression difference, presents a critical opportunity to study the effect of a dysfunctional DNA repair-caused mutation at the transcriptome level. The effect of the premature stop codon is a gene expression difference between the FA and FA_RV sample groups with a fold change of 5.07 (Supporting information Table S4). GUCY1B3 was found to be upregulated in the FA_RV sample group, which is consistent with the expectation that the sample group (FA) with the high-impact DNA variant will give rise to deterred gene expression. This confirms that although there was significant divergence between the genomes of the biological replicates, the effect at the transcriptome level was negligible, again suggesting that substantial changes to regulatory regions were rare or absent, perhaps because they were under significantly higher selection pressure. Also, as a control, and to put these numbers into context, three 1 kGP RNA-Seq samples were downloaded and pairwise analyzed for differentially expressed genes. The number of differentially expressed genes in these three samples ranged from 105 to 236, significantly less than the number of genes whose expression changed in FA samples. It should be noted that the expression measurements in the 1 kGP samples were for different individuals under divergent conditions, while the FA samples were from the same individual under controlled culture conditions.
In addition to the raw expression level changes, there are indications that these genes are alternatively spliced. Pairwise comparisons of exon count in genes within and across the FA and FA_RV sample groups showed that on average 28.5% of expressed genes had varying exon counts across sample groups, while 23.5% of expressed genes had varying exon counts within sample groups. An increase of 5% of genes (795 genes) with varying exon counts between FA and FA_RV sample groups showed the effects of the dysfunctional FANCD2 gene on alternative splicing (Supporting information Table  S6). These findings suggest that a dysfunctional DNA repair mechanism leads to DNA damage which in turn affects gene expression through truncated RNA transcripts rather than directly affecting the number of genes that are expressed.
T A B L E 5 Of the 270 differentially expressed genes, those with the highest fold change are illustrated here. Eight had a high gene expression fold change and were significantly divergent from the exponential distribution of the full set of genes, when comparing Fanconi anemia (FA) and FA_RV sample groups A "+" indicates higher gene expression in FA_RV sample group, and a "−" indicates higher gene expression in FA sample group. See Supporting information Table S4 for the full list. See Supporting information Figure S1 for exponential fit graph.

| GO analysis of genes with differential expression and/or genomic variants
To gain insights into the mechanistic role of genes that were affected by a DNA repair gene (FANCD2) dysfunction, two sets of genes were analyzed for their gene ontology term enrichment and their overrepresentation in pathways: (a) set of genes (270) were found to be differentially expressed (RNA-Seq) between the FA and FA_RV sample groups; (b) set of genes (82) that were associated with high-impact DNA variants that were specific to the FA sample group and were not found in the FA_RV sample group (Supporting information  Table S7). Pathway analysis of the FA-specific gene set (82) shows involvement in a variety of functions, including apoptosis and transcriptional regulation, along with known immune-related FA functions such as antigen presentation and immune-related signaling (Supporting information Table S8). The common gene variant set (270), on the other hand, is predominantly involved in immune-related pathway functions. This suggests that earlier mutations target the immune system while further mutations that are caused by the unrepaired FA pathway may lead to a wider variety of genomic effects initiating apoptosis.
Pathway analysis of the differentially expressed genes showed that signaling genes were specifically involved in immunological processes, including interferon cell signaling. One of the highly overrepresented pathways as indicated by the differentially expressed genes was the endosomal pathway (Supporting information Table S9). The cathepsin L gene (CTSL -OMIM #116880; FDR: 5.88E-15) that is mainly involved in the lysosomal degradation of proteins was not only differentially expressed between the FA and FA_RV sample group, but it was upregulated in the FA samples and downregulated in the FA_RV samples. The upregulation of the CTSL gene that is directly involved in lysosomal activity is consistent with the increase in the production of broken transcripts (Supporting information Table S4), which can eventually lead to stalled elongation of protein molecules and ribophagy through lysosomes (Doma & Parker, 2006;Lafontaine, 2010).

| CONCLUSIONS
In this study, we quantified the variants from but one of the many (currently estimated at 22) genes that are members of the FA DNA repair complex and found many high-impact variants that have accumulated as a consequence of uncorrected genomic mistakes. Although there were many sequence variants, relatively few genes with altered expression were found. Together, these sequence and expression changes occurred in genes which are typically associated with a spectrum of clinical disorders. The primary clinical disorders suffered by FA patients are consistent with the ones that ontological and pathway analyses would predict for the genes harboring the observed variants. For example, cell viability and the regulation of the extracellular matrix have been convincingly linked to cancer progression (Takamoto, Leppert, & Yu, 1998). The upregulation of a collection of collagen coding proteins we observed is consistent with upregulation of cell death signaling and phagosomal activity. Indeed, FA patients are predisposed to cancers, especially acute myeloid leukemia and squamous cell carcinomas (Mathew, 2006;Takamoto et al., 1998). These rapid genetic changes suggest that the clinical fate of any individual FA patient is likely determined by the unique spectrum of uncorrected stochastic variations accumulated during the lifetime of the patient. Overall, our results provide new insights into genome-and transcriptome-wide effects of a dysfunctional DNA repair gene. Better understanding of the effects of this and other FA genes underlying downstream molecular events that ultimately lead to the disease-causing phenotypical changes will have critical implications for future therapeutic strategies.

DATA ACCESSIBILITY
Raw and processed data have been deposited at NCBI's SRA (#SRP162355).