Evaluation of repair activity by quantification of ribonucleotides in the genome

Abstract Ribonucleotides incorporated in the genome are a source of endogenous DNA damage and also serve as signals for repair. Although recent advances of ribonucleotide detection by sequencing, the balance between incorporation and repair of ribonucleotides has not been elucidated. Here, we describe a competitive sequencing method, Ribonucleotide Scanning Quantification sequencing (RiSQ‐seq), which enables absolute quantification of misincorporated ribonucleotides throughout the genome by background normalization and standard adjustment within a single sample. RiSQ‐seq analysis of cells harboring wild‐type DNA polymerases revealed that ribonucleotides were incorporated nonuniformly in the genome with a 3′‐shifted distribution and preference for GC sequences. Although ribonucleotide profiles in wild‐type and repair‐deficient mutant strains showed a similar pattern, direct comparison of distinct ribonucleotide levels in the strains by RiSQ‐seq enabled evaluation of ribonucleotide excision repair activity at base resolution and revealed the strand bias of repair. The distinct preferences of ribonucleotide incorporation and repair create vulnerable regions associated with indel hotspots, suggesting that repair at sites of ribonucleotide misincorporation serves to maintain genome integrity and that RiSQ‐seq can provide an estimate of indel risk.

, each top panel shows annotations: boxes indicate genes on the Watson strand (magenta) or Crick strand (cyan) and replication origins (black), and triangles indicate centromeres. Each middle panel shows a line plot of rNMP profiles: Watson strand profiles are depicted as red (wild type) and orange (rnh201 ) lines, and Crick strand profiles are depicted as blue (wild type) and green (rnh201 ) lines. For chromosomes, 42-base bin data are plotted, and, for rDNA repeat unit, 21-base bin data are plotted. In the Ty1 rNMP profiles, rNMP values in 21-base bins are calculated from aligned Ty1 elements by Clustal W. Each bottom panel indicates GC-content profile and chromosomal position. All data were obtained from cells cultured in YPD. Regions with mapping bias that were not included in the quantification analysis are covered with gray boxes. (B) Scatter plot of read counts in chromosomal strands: PCR-amplified rNMP reads vs PCR-free rNMP reads (wild-type replicate 1 of YPD-cultured cells). Linear regression model without intercept is shown. Open circles indicate data not used for regression analysis. Figure S3. Nucleotide composition of genome and rNMPs. (A) Nuclear and mitochondrial nucleotide composition is shown. (B) Difference between leading and lagging strands is shown. Significance of differences was evaluated by two-sided Chi-squared test (N=500). "ns" indicates not significant (p 0.05). Average composition was analyzed in PCR-free data obtained from YPD-cultured cells.  * For STD samples, the number of Read2 (reverse read) is shown.

Supplemental
** For STD samples, the number of mapped Read2 (reverse read) is shown. † Correction factor of PCR reads for PCR-free scale adjustment. § Values at background read sample indicate correction factor of background 3'-shifted end for control analysis. ‡ R standard in Figure 1A. Correction factors from rNMP detection efficiency of spiked-in standerds for RNase H treated sample.

Alkali-denaturing agarose gel analysis
Alkaline digestion and alkaline agarose electrophoresis were performed with 2 µg of genomic DNA on 0.8% agarose gels as previously described (Nick McElhinny et al., 2010). After electrophoresis, gels were stained with GelRed (Biotium) for 10 hours, and images were captured on a After AMPure XP bead purification (1:1.5), RNase H-treated DNAs were separated on alkaline agarose gels as described above and capillary-transferred to Hybond N + membranes prior to depurination (0.1 N, HCl for 20 min) and neutralization [1 M Tris-Cl (pH 8.1), 1.5 M NaCl for 30 min]. Southern hybridization was performed using 32 P-α-dCTP random-primed labeled PCR fragments as probes (Chr. III: 206,369 and Chr. XII: 848,112). Southern blot images were captured on a FLA-7000 (GE Healthcare), and lane profiles were analyzed using Image Quant TL. In Figure S2E and S2F, the input level of Southern blot image was adjusted to 25% in a raw tiff file by Adobe Photoshop CS6.
27. Load into illumina sequencer as PCR-free library.

Load into illumina sequencer as PCR-amplified library.
Data processing

Sequence read mapping and coverage counting
Reference sequences of S. cerevisiae sequences for read mapping were from the ensemble build (EF4.68) of the S288C reference genome, and chromosome numbering was changed from Roman numerals to Arabic numerals. For Ty1 retrotransposon analysis in the S288C genome, 31 Ty1 retrotransposons with 1 kb franking sequences were used for mapping. Ty1 consensus sequence was from CLUSTALW alignments (Larkin et al., 2007) of the 31 Ty1 sequences. For mapping to rDNA repeat unit sequence (chr12:458,433-467,569) and mitochondrial genome sequence (chrmt:1-85,779), rDNA and chrmt sequences were shift to 5 kb and 10 kb downstream, respectively, to compensate mapping bias around sequence ends, and both original-and shiftedreferences were used for mapping. In STD fragment analysis, ΦX174 based synthetic fragment sequences with or without SNPs were used as reference sequences.
Read mapping was performed by bowtie 1.0.0 (Langmead, Trapnell, Pop, & Salzberg, 2009) onto reference sequences: paired-end read mapping (-S --fr -k 1 -n 3 -X 1000) for genomic reads, and single-read mapping of read2 set (-S --best -k 1 -n 0) for STD reads. After selecting properly mapped pairs by SAMtools 0.1.19 (Li et al., 2009) , the each pair of genomic reads in SAM files was directly converted to single forward (read1 direction) strand fragment (both background and rNMP reads) and rNMP position (immediately 3' of rNMP reads) in BED files by AWK scripts. SAM data of STD reads were directly converted to read count data of each STD fragments with SNPs and mapped-end position by AWK scripts. Coverage of mapped fragments and rNMPs onto single-base bin files in BED format was computed by bedtools 2.24.0 (Quinlan & Hall, 2010) (coverageBed -s). These coverage data were mainly used for rNMP accumulation analysis in following processes.

STD fragment analysis
Input rNMP rates were estimated from primary library sequencing by SNP detection of STD fragments: rXMP input rate (X=A,C,G,U) = rXMPfragment / (rNMPf ragment (N=A&C or G&U) + STDbackgr ound) RiSQ-seq rNMP rates were estimated from read counts mapped to proper ends of STD fragments: rXMP RiSQ-seq rate (X=A,C,G,U) = rXMPrXMPend / (rNMPrNMPend (N=A&C or G&U) + rNMPf ull-end + STDful l-end) Recovery rates were calculated by normalization of RiSQ-seq rNMP rate with rNMP rate: rXMP recovery rate (X=A,C,G,U) = rXMP RiSQ-seq rate / rXMP input rate rNMP recovery rates were mean average of rXMP(X=A, C,G,U RNase H digestion in background reads were normalized by total STD background reads as undigested rates. Next, RNase H digestion rates (rXMP digestion rate (X=A,C,G,U)) were estimated by subtraction of undigested rates from input rNMP (rXMP input rate) rates. RNase H digestion efficiency, RDG was estimated by linear regression model by R [lm(rXMP input rate (X=A,C,G,U) ~ rXMP digestion rate +0)]. Finally, Rst andard mock was calculated by normalization of Rst andard by RDG: Rst andard mock = Rst andard / RDG

PCR library adjustment
For scale correction of PCR-amplified rNMP data, rNMP coverage onto full-length chromosome bins was computed and analyzed by linear regression analysis without intercept. Rpcr was estimated by R function: lm(PCR_rNMPchr .strands~ PCR-free_rNMPchr .strands +0). For control data, immediately 3' positions of all properly mapped fragments (total background end: Bg-end) in RNase H untreated samples were adjusted to scale of secondary-adaptor reads (SA-end) as random targets of secondary adaptor ligation. Correction coefficients for control data Rcont rol were also estimated by R function: lm(Bg-end chr.strands ~ SA-end chr.strands +0).

Estimation of rNMP accumulation
rNMP accumulation was estimated from read coverage of rNMP and background at single base bins on each strand. Except a single base meta analysis of mitochondrial rNMP ( Figure S4F), all rNMP accumulation values were analyzed with 42-(whole genome) or 21-base (Ty1 and rDNA units) bins that are consist of 7 trios of base blocks and sliding with 14 or 7 base step (see below). Mean value of the 7 trios was accounted to be rNMP accumulation level at the center 14 or 7 base bin. These trios in each bin allowed providing nonoverlapped data sets and comparing different samples.

Extraction of biased regions
Since multiple copy sequences tended to show biased coverage of background reads, this study excluded such regions from rNMP accumulation analysis. Precise profiles of rDNA repeat unit and Ty1 retrotransposons were separately analyzed at 21 base bin level. To extract biased region, background reads (YPD, background reads in WT and rnh201∆ samples) were mapped onto reference sequences by paired-end and single-end read modes by bowtie 1.0.0 (paired-end mode: --fr -S -k 1 -n 3 -X 1000, single-end mode: -k 1 -S -n 3 --best). After conversion of mapping data to coverage data of 252 base bins with 14 base sliding step by SAMtools and bedtools, regions in which coverage rate between paired-and single-end mapping was over 1.5 were selected from both Watson and Crick strand and merged by bedtools (mergeBed). Biased clusters over 1 kb, telomere proximal regions (0.5 kb from telomere end), rDNA region and 5S rDNAs were combined as biased regions.

Read distribution analysis
Using PCR-amplified rNMP coverage data on single base bins, total rNMP counts on nuclear genome without biased regions were calculated for normalization. rNMP PCR coverage data on 42-base bins with 14-base sliding step were normalized by total rNMP counts and bin size of each sample (reads per million / base: rpm / base).

GC content estimation and peak selection
GC content was calculated from 126 base window sliding with 14 base step (bed file) that associated with 14 base bin at the center of the window by bedtools and reference sequences fasta (nucbed -fi reference.fasta -bed 126-base_window.bed). GC content of genes, nucleosomes, and other segments were calculated from mean GC content of associated bins. GC-(GC>0.5) and AT-peaks (GC<0.2) were selected as peak center bins in successive 5 windows with enough length of intervals: the interval of GC-peaks was over 250 bases, and 500 bases for AT-peaks. The replication direction of each peak was classified by Okazaki fragment rate described below.

Parameter for replication direction
For evaluate rate of replication direction at each genomic region, fragment coverage of Okazaki fragment sequencing data in previously study was used (Smith & Whitehouse, 2012) Paired-end reads data (WT_sample data: SRR364781) were mapped onto reference sequences by bowtie 1.0.0 (-S --fr -k 1 -n 3 -X 1000), the resultant mapping data (SAM) were converted to fragment data (BED), and coverage of Okazaki fragments (OF) at single-base bins of Watson and Crick strands were computed by bedtools (-s coverageBed). Total coverage of OF at 238 base windows on Watson (OF Watson) and Crick (OF Crick) strands were converted to OF rate (OF rate) data for 14 base bins at the center of the windows: OF %&'( = OF K&'8E9 (OF K&'8E9 + OF 2%*DH ) OF rate>0.5 indicates left to right replication major and OF rate<0.5 indicates right to left replication.
OF bias was also used for selection of unidirectional replicated regions: OF L*&8 = max (OF %&'( , 1 − OF %&'( ) The OF rate of genes, nucleosomes, and other segments were calculated from mean OF rate of associated bins.

Simple repeats calling
Homo-polymers, dinucleotide and triplet repeats in S. cerevisiae genome were searched by non-B DB v2.0 (Cer et al., 2013) and classified by repeat types for analysis.

RTM peaks collection
Peaks and valleys of RTM were collected from RTM lists as maximums and minimums in 5 successive 42base bins.
RMT thresholds of accuracy over 0.99 (False positive rate <0.01) were estimated from threshold lists. Regions over 0.99 accuracy thresholds were extracted from RTM lists with 14 base bins (BED) and merged them by bedtools (mergeBed).