Gain‐of‐function analysis of cis‐acting diversification elements in DT40 cells

Activation‐induced cytidine deaminase (AID) is required for the immunoglobulin diversification processes of somatic hypermutation, gene conversion and class‐switch recombination. The targeting of AID's deamination activity is thought to be a combination of cis‐ and trans‐acting elements, but has not been fully elucidated. Deletion analysis of putative proximal cis‐regulatory motifs, while helpful, fails to identify additive versus cumulative effects, redundancy, and may create new motifs where none previously existed. In contrast, gain‐of‐function analysis can be more insightful with fewer of the same drawbacks and the output is a positive result. Here, we show five defined DNA regions of the avian Igλ locus that are sufficient to confer events of hypermutation to a target gene. In our analysis, the essential cis‐targeting elements fully reconstituted diversification of a transgene under heterologous promotion in the avian B‐cell line DT40. Furthermore, to the best of our knowledge two of the five regions we report on here have not previously been described as individually having an influence on somatic hypermutation.


INTRODUCTION
In the jawed vertebrate's adaptive immune response, the first stage of immunoglobulin (Ig) combinatorial repertoire development is based on antigen-independent V(D)J recombination. In the second stage of B-cell maturation, stimulation induces gene conversion (GC) and/or somatic hypermutation (SHM) of the Ig gene, and class-switch recombination (CSR). 1,2 All three processes, GC/SHM/ CSR, strongly depend on the function of Activationinduced cytidine deaminase (AID). 3,4 AID initiates the process by the deamination of deoxycytidine thereby generating uracil residues in the DNA. 5 The resulting U:G mismatches in the context of immunoglobulin SHM have not been shown to be significantly resolved via mismatchmediated repair. 6 Rather, the uracils are further processed by uracil-DNA glycosylase giving rise to abasic sites. 7 An abasic site can be repaired either in a relatively error-free process such as base excision repair or re-routed into an error-prone process such as trans-lesion synthesis. 8,9 In the case of multiple AID-induced lesions in close proximity, even error-free repair can lead to repair-induced staggered double-strand breaks, thus complicating the process and leading to repair path re-routing. 10 As deamination activity has intrinsic mutagenic potential and erroneous targeting of AID can lead to B-cell lymphoma, AID's activity needs to be tightly regulated. 11,12 Although several groups have shown that AID is not completely restricted to the Ig locus, its mutational activity at other loci is lower by several orders of magnitude. [13][14][15][16] The mechanism of AID's locus specific enhanced targeting is not fully understood, but currently thought to include the interplay of trans-acting factors and cis-targeting elements. Some of the trans-acting factors that have been shown to exert an effect are E2A, NF-jB, Nemo, PU.1, Pax5, PARP1 and Positive Cofactor 4 (PC4). 10,[17][18][19][20][21] However, it is not clear in regard to some identified trans-acting factors if the observed effect is a direct one or due to signaling and/or transcriptional activities.
In support of cis-acting elements, several studies have shown various proximal segments of DNA sequence having influence to differing degrees on SHM, GC or CSR outcomes. [22][23][24][25][26] Many identified sequences, e.g. E-boxes, enhancers and silencers, point further to transcriptional influences. Deletion analysis, while helpful to identify new regions of interest, may be thwarted by redundant elements or spatio-temporal effects. 27 In a gain-of-function analysis, regions of DNA analyzed can be better controlled for confounding factors, more flexibly tested and activity leads to a positive result.
In this report, we show in a gain-of-function analysis the activity of five putative motif containing cis-targeting regions of DNA in DT40 Chicken B-cells that were identified in a supervised search. Using transfection constructs having constitutive expression of a fluorescent transgene target to separate out transcriptional influences, we tested the selected segments of proximal DNA individually, directionally and in various combinations. Our results clearly suggest that while each element led to differing measured levels of activity individually and in simple combinations, it is the cumulative "super-motif" architecture that defines the process.

Supervised search of putative Cis-targeting motifs
As an AID targeting motif database does not exist, we hypothesized that any such motifs would most likely be located within regions of conserved genome sequence homology, hypersensitive site regions and/or co-located with clusters of suspected protein/transcription-binding motifs. In addition, the pattern of motif clustering itself could be the determining cis-targeting "super-motif" driving function. Therefore, we performed a supervised motif search of DT40's Igk locus sequence using various web-based tools, e.g. Cis-element Cluster Finder -Cister, 28 hidden Markov modeling -HMMER and Improbizer, 29,30 Multiple Em for Motif Elicitation -MEME, 31 Motif Alignment & Search Tool -MAST, 32 and BLAT (WUGSC 2.1/galGal3 build). 33 When the tools used required multiple genomic sequence input for comparison, we included available immunoglobulin loci data from Cairina moschata (Muscovy duck), Meleagris gallopavo (turkey), Mus musculus and Homo sapiens. However, sequence homology alone of the closely related members of the class Aves was not used as a deciding factor. We merged the results and found five regions most likely to fit our hypothesis ( Figure 1a). 34 The L-region is taken from the intron (leader sequence) of Vk and is CpG island rich. The R-region goes from the end of the constant domain, ends prior to a putative silencer and encompasses conserved non-coding sequences to that of Homo sapiens, Danio rerio and Xenopus laevis. The Sregion encompasses part of the putative 3'-Enhancer and an expanded E-box (CANNTG) type Basic Helix Loop Helix motif situated adjacent to an NF-jB (GGRRNNYYCC) site (CGCAGCTGTGCGGCCGGGGC ATCCC). The U-region is part of a 3' hypersensitive region and contains duplicates of 2 putative motifs found multiple times in the R-region. The N-region encompasses part of a second 3' hypersensitive region, contains a putative zinc finger motif that is also found on the reverse strand in the R-region, and conserved non-coding sequences to that of Didelphis marsupialis. The top three putative motif associations found in the supervised search were paired-homeo (e.g. Pax4), paired (e.g. Pax5, Pax6) and C2H2-type zinc finger (e.g. NEMO, ZAS3).

Testing of the L, R, S, U and N regions for gain-offunction
In testing for gain-of-function, we included the cell lines Aidr1 Igk-CisKoGFP2 as negative control and for positive control Aidr1 Igk-CisKoGFP2-W, which contains the "W" fragment (the complete region encompassing the V(D)J rearranged~9.8 kb from the Igk transcription start to carbonic anhydrase 15-like, Figure 1a) as previously described. 25 As the CpG island rich L-region is situated in the beginning of the Vk segment, we wanted to test it in a more native position.
To achieve this, we cloned the L-region into the test plasmids between the RSV promoter and the eGFP transgene reporter while the R, S, U and N test elements were cloned into the constructs upstream of the RSV promoter for targeting of the Igk-promoter or the entire Igk locus (Figure 1b, c).
We observed in our fluctuation analysis, 2 weeks (~25 generations) post subcloning, that each region individually induces some level of activity with the Nregion and triplicate of the S-region cloned in the reverse direction demonstrating the strongest effect ( Figure 2a). While we see some directional variation in regard to the S-region, the more prominent effect comes from element multiplicity. In Figure 2b, we start to observe the combinatorial effects with RU and RSSN regions seemingly having the strongest effect. In Figure 2c, we see the cumulative importance of all the elements together whereby RSSUN appears to induce higher activity than SSUN, RSSU (Wilcoxon rank-sum test: P = 0.00012) and even the complete "W" region (Wilcoxon rank-sum test: P = 0.048). To rule out transcription levels playing a role in these results, we tested two primary clones from 18 selected element test clones generated and biological replicates of both the CiskoGFP2 (Cl9) negative control and CisKoGFP2-W (Cl3) positive control using RT-qPCR. As can be seen by the ΔCT in Figure 3, transcription levels were stable in comparison to the BAFF reference gene regardless of selected element tested.
For further testing of the combined regions, we employed our more flexibly and modularly designed pForGene (pFG) plasmid system. The pFG plasmid's multiple multi-cloning-site's design allows for the cloning of test elements both upstream and downstream of the RSV promoter, and also allows for the easy exchange of selectable marker cassettes. Initially, we wanted to test the L-region's activity in Aidr1 wvcells using pFG2 which targets the Igk-promoter (Figure 1b) while leaving the "W" region in its native position. As can be seen in Figure 4, the L-region has a significant enhancing effect (Wilcoxon rank-sum test: Aidr1 wv-pFG2GFP-L versus Aidr1 wv-pFG2GFP P ≤ 0.0001). The effect is similar in the ΔPC4 background suggesting a role that is not associated with damage repair as we have previously shown for Positive Cofactor 4. 10 This argues for the L-region's role being most probably associated with the targeting or enhancement of AID activity. This was further confirmed with sequencing 2-week bulk cultures of Aidr1 wv-pFG2GFP, Aidr1 wv-pFG2GFP-L and Aidr1 wv-DPC4 pFG2GFP-L (Figure 5a-c). While the L-region alone increased the mutation rate slightly, the mutation spectrum ( Figure 5d) results are in line with our previously published results for Aidr1 wv-pFG2GFP (5 x 10 À5 mut/bp/gen, trs:trv = 1:0.95) (mutations/base pair/generation, transitions:transversions) and Aidr1 wv-DPC4 pFG2GFP (4.7 9 10 À5 mut/bp/gen, trs:trv = 1:3.4). 10 Interestingly, both the L-region alone and the L-region in the DPC4 genomic background decreased the number of mutations falling outside the canonical WGCW hotspots 35  In Figure 6, we see the results of L and RSSUN in the pFG3 construct which targets the entire Igk locus for knock-out (Figure 1c). We tested the RSSUN in the forward (+) and reverse (-) directions, with and without the L-region, and in some instances with and without AID. As can be readily observed, the L-RSSUN (-) and L-RSSUN(+), as well as RSSUN(+) achieved full reconstitution of activity in comparison to the Aidr1 Igk-CisKoGFP2-W positive control. Interestingly,  (a) individual regions tested singly, multiply and in some cases directionally (signified by backwards lettering), (b) regions tested in simple combinations including directionally, (c) regions further tested in multiple combinations. Each element was tested using 2-4 primary clones, and the CisKoGFP2 negative control was tested as a biological triplicate. The reconstituted positive control, CisKoGFP2-W, was a single primary clone included for reference. Primary clones and biological replicates were subcloned by limiting dilution and up to 36 subclones for each were measured by flow cytometry for GFP fluorescence intensity reduction/loss after 14 days (25 generations). Each data point represents the result of a single subclone and the bar indicates the mean. The dotted reference lines signify the mean values of the Aid r1 Igk-CisKoGFP2-W positive control and the Aid r1 Igk-CisKoGFP2 negative control. Genome representation not drawn to scale. Wilcoxon rank-sum test: RSSUN versus 'W' P = 0.048; RSSUN versus SSUN and RSSU P = 0.00012. ◂ the RSSUN(-) without the L-region performed less well. However, as expected, the absence of AID reduced the GFP fluorescence intensity reduction rate to effectively 0%.

DISCUSSION
In the age of omics-based Big Data, even small experimentally validated functional datasets are crucial to informing training datasets in order to improve predictive modeling. 36 While there exists a substantial body of resources for transcription and transcriptionrelated factors, little is known concerning more complex regulatory systems encoded in the DNA that are not as amenable to current bulk data producing protocols. With limited knowledge of AID activity directing and enabling cis-targeting motifs a priori, we approached the supervised motif region search systematically using welldefined parameters regarding conserved sequence homology, motif co-localization and biochemically elucidated hypersensitive sites. We also held the hypothesis that multiple motif-containing regions would most probably be involved since the end effect is the deamination of genomic DNA, an action that needs to be tightly controlled. With our supervised motif search and gain-of-function approach, we identified five distinct cisregulatory regions that have a role in targeting AID activity that appears in our system, via the use of a constitutive promoter, to be uncoupled from transcription/expression regulation influence.
Although we have identified five distinct functional regions, our experimental system is limited in that it is not geared for high-throughput studies. It is simply not feasibly practical to test each region base by base in living cells to further define the exact DNA motifs and associated contextual nucleotide flanking sequences within the regions identified. For further motif analysis, it would be preferable to use ours and other's published results to aid computer-assisted Big Data modeling to reduce the number of clones needed for further validation studies. We partly did this in regard to the multiplicity of the S-region, as proximal regions upstream and downstream contained additional desired canonical and non-canonical adjacently located E-box and NF-kB sites. 37 However, those regions also contained putative motifs that were selected against in our supervised motif search, e.g. silencers. As the Igk locus is highly repetitive and the R, U and N regions already contain multiple repeats of selected putative motifs, we chose to test the S-region multiply. By doing so, we overcame the possible inclusion of activity reducing regulatory elements to achieve a similar "super-motif" architecture.
Our results will complement and inform many of the newer Big Data strategies employed to uncover AID onand off-target regulation such as epigenetic AID targeting. 38 AID activity is a critical biological function that can also be highly detrimental if not properly regulated. AID's off-target aspect is of great importance as besides lymphomagenesis, AID has recently been  implicated in active demethylation, kataegis and nonlymphoid cell carcinogenesis. [39][40][41] METHODS Cell culture, cell transfection and growth curve DT40 cells were cultured in DMEM/F-12 supplemented with 10% fetal bovine serum, 1% chicken serum, 2 mmol L À1 L-glutamine, 0.1 mmol L À1 b-mercaptoethanol and penicillin/ streptomycin (chicken media, CM) at 41°C in a 5% CO 2 environment. Transfections were carried out using the Gene Pulser Xcell System with CE Module (Bio-Rad, Hercules, CA, USA). Briefly, 10 7 logarithmically growing cells of ≥80% viability were harvested for each transfection. The cells were resuspended in 800 lL ice-cold PBS with 40 lg of linearized plasmid and placed into an electroporation cuvette (0.4 cm gap). For transfecting DT40 cells, the exponential protocol setting with 700 V and 25 lF was used. Upon electroporation, the cells were transferred immediately into 10 mL CM and then aliquoted at 100 lL per well across a 96well plate and incubated at 41°C. After 24 h (DT40's doubling rate is~13 h 10 ), 100 lL of appropriate drug selection media was added per well (final concentration: Blasticidin S 15 lg mL À1 , Puromycin 0.5 lg mL À1 ) and incubated for a further 1-2 weeks for colony picking and testing. Targeting in DT40 was confirmed via PCR using primer pairs PS31-PU5 for Ig-Pro-KO (pFG2 constructs), PS169-PU5 for Igk-KO (pFG3 and pCisKoGFP2 constructs).

PCR, cloning and sequencing
For the initial cloning and testing of selected cis-regions, we used the pCisKoGFP2 (targeting the entire Igk locus) construct as previously described. 25 The GFP2 cassette has a modified cloning site upstream of the RSV-GFP-ires-BSR-SV40pA consisting of NheI and SpeI. The R,S,U,N fragments have the following PCR introduced modular designed restriction sites for bi-directional, individual, multiple and sequential cloning (and primers used):R -AvrII/Xba-SpeI/NheI/AvrII (rbc557, rbc558), S -SpeI/NdeI-NdeI/NheI (rbc559, rbc561), U -SpeI/KasI-KasI/ NheI (rbc563, rbc564), N -SpeI/NsiI-NsiI/NheI/BamHI (rbc565, rbc566). The L-region was assembled for cloning using the overlapping oligos rbc261, rbc262, rbc263 and rbc264 that incorporated a NheI site on the 5' end and a SpeI site on the 3' end. For further cloning of the L-region, rbc600 and rbc601 were used to introduce a SpeI site 5' and an NheI site 3'. For confirmation testing in DT40 cells, we used our pForGene (pFG) Figure 6. Confirmation and directional testing of 'RSSUN' activity with and without the L-region in DT40 Ig-p (cis knockout) cells. Fluctuation analysis of GFP fluorescence intensity reduction/loss. Each element configuration tested had 2-4 primary clones, and a single primary clone for each of the Aid-/-constructs was tested. Cells were subcloned by limiting dilution and up to 36 subclones for each were measured after 14 days (25 generations) by flow cytometry for fluorescence intensity reduction/loss. Each data point represents the result of a single subclone and the bar indicates the mean. Genome representation not drawn to scale. system as previously described. 10 Briefly, the pFG's cassette design is BamHI-SpeI-RSV-NheI/EcoRI/BglII-IRES-NcoI-BSR/Puro-NcoI-SV40pA-BamHI in a multi-cloning site-modified pBluescript KS+ (Agilent Technologies, Santa Clara, CA, USA) backbone. The base construct of pFG2 has homologous arms targeting the knockout of DT40's rearranged Igk's promoter (5' arm -chr15: 7933638-7935749, 3' arm (minus intervening sequence) -chr15: 7930437-7961015/7932855-7933303), while the construct of pFG3 has homologous arms targeting the knock-out of DT40's entire Igk locus (5' arm -chr15: 7955324-7957914, 3' arm -chr15: 7920634-7921694) as previously described. 10,25 For the cloning of the transgene fluorescent marker, the primer pair rbc598-rbc602 was used for eGFP. All cloning was carried out using restriction enzymes and competent cells (New England Biolabs, Ipswich, MA, USA), and the Takahara DNA Ligation Kit (Clontech, Mountain View, CA, USA) per the manufacturers' instructions.
For sequencing of the GFP transgene, genomic DNA (gDNA) was isolated using the DNeasy Blood & Tissue Kit (Qiagen, Venlo, the Netherlands) according to the manufacturer's instructions. The gDNA served as template for the primer pair rbc583-rbc587 and the PCR products were cloned for sequencing as previously described. 10 Bi-directional sequencing was done using the Big Dye Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific) with primers rbc776 and rbc585 (LGFP), and rbc734 and rbc585 (GFP) as previously described. 10,42 Primers used are listed in Supplementary table 1,  and Supplementary table 2 gives an overview of clones used and experiments performed.

RT-qPCR
To assess whether RSV-promoter driven transcription was being influenced by the cis-regions tested, cDNA was made from total mRNA using the RNeasy Mini Kit (Qiagen) and SuperScript III First-Strand Synthesis System (Thermo Fisher Scientific) according to the manufacturer's instructions. Quantitative PCR (qPCR) was performed with the LightCycler 1.5 (Roche) using the FastStart DNA Master SYBR Green I kit (Roche) according to the manufacturer's instructions. Primers used were rbc9-rbc10 for the B-cell activating factor (BAFF) control, and rbc843-rbc844 for GFP.

Flow cytometry
For monitoring of cells based on fluorescence status, cells were washed twice and resuspended in sterile PBS. Flow cytometry was performed using the LSR II (BD Biosciences), whereby 5000-20 000 live cells were counted for each subclone using 488 nm for excitation, and measuring GFP at 530 nm. Gating for fluorescence intensity reduction/loss measurement was set at >2-fold below the cloud and subclones demonstrating ≥75% reduction were discounted as likely harboring a mutation prior to subcloning.

Statistical analysis
For comparison of the results generated in the fluctuation analyses, the Wilcoxon rank-sum test (also called the Mann-Whitney U-test) was used to compare the subclone data points, and P-values less than 0.05 were considered as significant.