epiGBS2: Improvements and evaluation of highly multiplexed, epiGBS‐based reduced representation bisulfite sequencing

Abstract Several reduced‐representation bisulfite sequencing methods have been developed in recent years to determine cytosine methylation de novo in nonmodel species. Here, we present epiGBS2, a laboratory protocol based on epiGBS with a revised and user‐friendly bioinformatics pipeline for a wide range of species with or without a reference genome. epiGBS2 is cost‐ and time‐efficient and the computational workflow is designed in a user‐friendly and reproducible manner. The library protocol allows a flexible choice of restriction enzymes and a double digest. The bioinformatics pipeline was integrated in the snakemake workflow management system, which makes the pipeline easy to execute and modular, and parameter settings for important computational steps flexible. We implemented bismark for alignment and methylation analysis and we preprocessed alignment files by double masking to enable single nucleotide polymorphism calling with freebayes (epifreebayes). The performance of several critical steps in epiGBS2 was evaluated against baseline data sets from Arabidopsis thaliana and great tit (Parus major), which confirmed its overall good performance. We provide a detailed description of the laboratory protocol and an extensive manual of the bioinformatics pipeline, which is publicly accessible on github (https://github.com/nioo‐knaw/epiGBS2) and zenodo (https://doi.org/10.5281/zenodo.4764652).

line was integrated in the Snakemake workflow management system, which makes the pipeline easy to execute and modular, and parameter settings for important computational steps flexible. We implemented biSmark for alignment and methylation analysis and we preprocessed alignment files by double masking to enable single nucleotide polymorphism calling with FreebayeS (epiFreebayeS). The performance of several critical steps in epiGBS2 was evaluated against baseline data sets from Arabidopsis thaliana and great tit (Parus major), which confirmed its overall good performance. We provide a detailed description of the laboratory protocol and an extensive manual of the

| INTRODUC TI ON
Cytosine methylation at carbon position 5 (also termed 5-meC) is a chemical epigenetic modification of DNA. This modification can influence gene activity and expression and has the potential to affect transcription regulation (Zhang et al., 2018). Genome-wide 5-meC discovery is routinely performed using methods based on bisulfite treatment followed by high-throughput sequencing (BS-Seq) (Reyna-López et al., 1997). Whole genome BS-Seq (WGBS) (Suzuki et al., 2018) is the gold standard if financial resources and a reference genome are available, which is still not the case for the majority of organisms. While the popularity of BS-Seq studies is growing ( Figure S1), data are mainly generated for model species such as mouse, human and Arabidopsis thaliana, representing 46%, 34% and 4% of all BS-Seq data sets in the SRA (Table S1), respectively.
A less comprehensive but cheaper and versatile alternative to WGBS is BS-Seq in reduced representations of the genome, by using restriction enzyme fragmentation during the library preparation, such as RRBS (Meissner et al., 2005), epiGBS (van Gurp et al., 2016), BsRADseq (Trucchi et al., 2016), epiRADseq (Schield et al., 2016) and Creepi (Werner et al., 2020). Several easy-to-use bioinformatics tools and workflows have been developed to analyse BS-Seq data, such as BS-Seeker2 (Guo et al., 2013), biSmark (Krueger & Andrews, 2011) and baT (Kretzmer et al., 2017), which assume availability of a reference sequence for read mapping and methylation calling.
However, there is increasing interest in studying DNA methylation in nonmodel study species, for instance to understand the involvement of DNA methylation in ecological and evolutionary processes. Such methods have to deal with the absence of reference genomes, the complex genomes of nonmodel organisms and high sample numbers, and have to accommodate a simultaneous comparison of genetic and epigenetic data, for instance to examine how much of the overall epigenetic variation between samples can be predicted from pairwise genetic relatedness (Richards et al., 2017).
In a previous publication, we presented epiGBS as a reducedrepresentation DNA methylation analysis tool that combines those features (van Gurp et al., 2016). epiGBS calls both cytosine-specific quantitative DNA methylation levels and single nucleotide polymorphisms (SNPs) from the same bisulfite-converted samples, based on reconstructing the de novo consensus sequence of the targeted genomic loci. This means that the method can be applied also when no reference genome is available for the species under study (van Gurp et al., 2016). A similar approach was described by Werner et al. (2020), who provide a proof-of-concept via data obtained from almond and a PstI single enzyme digest. Here, we present epiGBS2, which consists of a detailed, updated laboratory protocol and a revised computational analysis pipeline that is accessible for all with basic knowledge in bioinformatics, and we subject the method to several performance tests. Executing epiGBS2 is cost-and timeefficient and is designed for user-friendly, reproducible and flexible analysis, allowing for an effective determination of methylation and SNP variants in a broad range of species, including plants and vertebrates such as birds (Sepers et al., 2019). We evaluate the performance of epiGBS2 by comparing the analysis results of A. thaliana accessions and great tit (Parus major) samples to published benchmarking sets.

| Construction of epiGBS libraries
In order to reduce sequencing bias and costs, several major improvements were made to the original epiGBS laboratory protocol, the majority of which were described recently by Boquete et al. (2020).
We briefly list the key improvements here. In addition, we present a detailed description of the adapter design, which allows flexible choice of the restriction enzyme pair, and a detailed step-by-step protocol of the library preparation in the Supporting Information.
The resulting epiGBS2 library is paired-end and directional, while additionally the information about the origin of the read is labelled by a control nucleotide. Reads originate either from the original top strand (Watson), the complementary Watson strand, the original bottom strand (Crick) or the complementary Crick strand.

| Identification of PCR duplicates
During the preparation of sequencing libraries, PCR clones can be produced. Removing these PCR duplicates computationally avoids overrepresented fragments caused by biased duplication, which allows for more accurate interpretation of results. Using common whole-genome sequencing laboratory protocols, sequence identity is a basis for identifying PCR duplicates. However, in reduced-representation approaches that use amplification of restriction enzyme-associated DNA, fragments of identical sequence bioinformatics pipeline, which is publicly accessible on github (https://github.com/ nioo-knaw/epiGBS2) and zenodo (https://doi.org/10.5281/zenodo.4764652). . This CN is an unmethylated cytosine, which is placed after the barcode followed by the sequence of the RE overhang ( Figure 1) and used for Watson/Crick annotation of the reads (Figure 2). Read pairs with T at the CN position of the R1-/ adapter BA-read and C at the CN position of the R2-/ adapter CO-read are defined as Watson; read pairs with C at the CN position of the R1-/ adapter BA-read and T at the CN position of the R2-/ adapter CO-read are defined as Crick. This design facilitates the use of various RE combinations and makes the epiGBS2 protocol more universally applicable. While the epiGBS protocol was originally optimized for plants, the freedom to also use other enzymes, such as for example MspI, makes epiGBS2 now also very effective for studies on other organisms, such as vertebrates.

| Use of hemimethylated adapters
To reduce costs, the library preparation protocol was adjusted in such a way that hemimethylated adapter pairs are used instead of fully methylated adapters. In epiGBS2 the cytosines of the oligonucleotides adapter BA-I and adapter CO-I are 5-C methylated ( Figure 1, and see laboratory protocol in the Supporting Information). The oligonucleotides of the opposite strands (adapter BA-II and adapter CO-II) contain unmethylated cytosines only and are 5′-dephosphorylated.
After annealing the respective BA-I and BA-II and CO-I and CO-II adapter oligonucleotides and ligating them with the enzyme-digested DNA fragment, only adapter 3′ ends and fragment 5′ ends ligate. A nick remains between adapter 5′ ends and fragment 3′ ends. The nick is repaired by using dNTPs that contain 5-meC's and that directly translate all 5′-3′ nucleotides starting from the nick. This results in fully methylated adapters that are ligated to the digested DNA fragment and a complementary 3-nucleotide short UMI sequence.

| Miscellaneous optimizations
As with reduced-representation sequencing (e.g., genotyping-bysequencing) of unconverted DNA, the optimal adapter concentration can vary depending on the frequency of enzyme cut sites in the DNA and the genome size (Wallace & Mitchell, 2017). We therefore advise to ascertain the optimal adapter concentration for each new study system and restriction enzyme or enzyme combination by methods as suggested by Wallace and Mitchell (2017).
F I G U R E 1 epiGBS2 uses hemimethylated adapters. epiGBS2 adapters consist of the Illumina adapter sequence (grey), a random 3 nucleotide sequence called UMI (yellow), a barcode (orange), a control nucleotide (blue) and the complement restriction enzyme site overhang sequence (grey). The 5′-3′ strand of the BA adapter and the 3′-5′ strand of the CO adapters contain only methylated cytosines (X); the opposite strands are unmethylated. All strands are dephosphorylated, so only adapter 3′ ends and DNA fragment 5′ ends ligate. During nick-translation the "broken" strands are replaced with 5mC-dNTPs, which results in fully methylated adapters gracefully if any step fails (Perkel, 2019). WMS also integrate with package managers such as conda (Grüning et al., 2018) and docker (Merkel, 2014), which install software dependencies automatically, and allow flexible integration with resource management systems such as slurm. In combination, WMS and package managers make the installation and execution of analysis pipelines accessible for biologists who have a basic knowledge in bioinformatics. The main computational steps of epiGBS2 are PCR clone removal, demultiplexing and Watson/Crick strand annotation, read quality control, adapter trimming, mapping to a reference (de novo or pre-existing), methylation calling and SNP calling. We embedded these steps in a Snakemake (version 6.1.1) workflow ( Figure 3) and divided them into specific rules that make the workflow modular and allow executing, exchanging or updating specific parts of the pipeline without modifying the others.
To simplify software installation and updating, we created conda (Grüning et al., 2018) environments in such a way that the workflow is portable and independent from the used Linux system. Each Snakemake rule calls specific conda environment files and automatically installs the required dependencies. Custom scripts were written in Python 3.
2.2.2 | Description of the Snakemake branches epiGBS2 runs in two modes, which the user can select from the Snakemake configuration file: either with a pre-existing reference genome or in de novo mode. These two branches are nearly identical in all Snakemake rules, except that in de novo mode the reference of the fragments under study is reconstructed from the epiGBS2 reads themselves (van Gurp et al., 2016) ( Figure S2). The reference mode was added to the workflow to facilitate analysis in species for which a reference genome is available. Optionally, epiGBS2 can also be run in a third "legacy" mode. This mode runs a version of epiGBS2 that was described in Gawehns et al. (2020)  First, PCR clones are removed, which are identified by identical F I G U R E 2 A control nucleotide (CN) is used for strand annotation. To allow a flexible choice in restriction enzymes but still be able to differentiate between original Watson (orange) and Crick (grey) strands, a control nucleotide (blue) was added to the epiGBS2 adapter, which consists of a single unmethylated cytosine in the BA-I and the CO-I adapters. During bisulfite treatment the cytosines are converted to thymines. After amplification the new bottom strand of the original Watson contains an adenine in the BA adapter but still a cytosine in the CO adapter; the new top strand of the original Crick contains a cytosine in the BA adapter and an adenine in the CO adapter. Subsequently, the R1 read of the Watson strand will contain a thymine at the control nucleotide position and a cytosine in the R2 read. R1 reads with a cytosine and a thymine in the R2 read will be annotated as Crick

| De novo reference creation
De novo reference creation is performed as described in van Gurp et al. (2016) and in Figure S2. To implement this in the de novo branch of the Snakemake workflow, the Watson and Crick R1 and R2 read files, which were created by merging the untrimmed reads of all samples after demultiplexing, are used as input. During R1-and R2-read assembly with pear, 3′ end adapter sequences are removed.
To reconstruct the consensus reference sequence, three clustering steps are performed: (i) deduplication of three-letter encoded Watson and Crick reads; (ii) pairing of binary Watson and Crick reads; and (iii) clustering of reconstructed reference clusters by identity. Finally, the created de novo reference is prepared for the alignment with biSmark by adding four N's at the beginning and end of the clusters to ensure that biSmark is able to align the reads to the de novo reference.

F I G U R E 3
Overview of the epiGBS2 pipeline. The main Snakemake rules are visualized. Boxes represent steps with modular output that can be executed individually. Orange: read preprocessing. PCR clones are removed from all input reads, and samples are demultiplexed with STackS2 and annotated as either Watson or Crick reads. These reads are adapter-trimmed with cuTadapT. Purple: in the de novo branch reads are either assembled with pear or joined with a custom script. These sequences are deduplicated, and Watson and Crick reads are paired and clustered based on identity. The minimum cluster size during deduplication and the identity percentage are introduced as variable parameters and can be set in the config file. Green: trimmed reads are aligned to the reference (either de novo clusters or pre-existing reference) with biSmark. Blue: methylation is also called with biSmark. Grey: alignment files are preprocessed by double masking to enable SNP calling with FreebayeS. Brown: the processed reads (trimmed and untrimmed) are analysed in a read quality control using FaSTqc and summarized with the log files of all other crucial steps in a mulTiqc summary report To allow adjustment of the de novo reference creation, we added the possibility to vary two de novo creation parameters directly from the Snakemake configuration ( Figure S2): The performance of the first deduplication step can be customized by setting a minimum and maximum cluster depth, and the last clustering step can be customized by setting the identity percentage for clustering. Fine-tuning these parameters may optimize de novo reference construction depending on species-specific genome characteristics.

| Adapter trimming
To prepare the demultiplexed and strand-annotated reads for the alignment step, they have to be adapter-trimmed at the 3′ end, because the majority of the deduplicated and demultiplexed paired-end reads are longer than the DNA fragment length. Therefore, reads are extended into the adapter sequence at the 3′ end of the fragment

| Parus major samples and description of benchmarking data
To analyse the performance of the epiGBS2 reference branch in a vertebrate species, four individual P. major samples from an epiGBS2 library were compared to reduced representation bisulfite sequencing (RRBS) data (Meissner et al., 2005) from the exact same samples.
These four individuals were part of a former RRBS study (Sepers et al., 2021). For detailed methods on sample collection and DNA isolation see Sepers et al. (2021). The epiGBS2 library preparation was conducted following the laboratory protocol described in the  (Laine et al., 2016). were excluded. The destrand option was set to TRUE to combine C's in the same CpG site (CpG dinucleotides). The performance of the epiGBS2 data was evaluated with correlation plots that were created in R (version 4.0.1) (R Core Team, 2021), using ggploT2 version 3.3.3 (Wickham, 2016). The coefficient of determination (R 2 ) was calculated using the ggpmisc package version 0.3.9 (Aphalo, 2021).

| Evaluation of epiGBS2 de novo reference creation
To analyse the performance of epiGBS2 de novo reference creation, the clusters generated by the epiGBS2 de novo branch analysis of A. thaliana samples were aligned to the TAIR_10 reference genome using minimap2 (2.17-r941) (Li, 2018) with the alignment criteria as set in the genomic short read preset (-x sr). The de novo reference clusters were generated based on pooled data from all six accessions used in the sequencing library. The mapping percentage was used to determine the quality of the de novo reference creation. To match the epiGBS2 de novo reference with the TAIR_10 reference, as necessary for subsequent comparison of cytosine methylation and SNP calls, the nucleotide positions of the de novo clusters were adjusted using a custom python script (src/lift_over.py).

| Performance testing of epiGBS2 methylation calling
When comparing methylation calls from the epiGBS2 study to available epigenomes in the 1001 epigenome project, effects of stochastic methylation differences between individuals were minimized by summarizing methylation and nonmethylation calls of individual samples as an average per accession. We compared methylation calls at cytosines that had a minimum coverage of 10× in the 1001 epigenome reference data and a minimum average coverage of 10× per individual sample in the epiGBS2 data (i.e., a total of 10× the number of individuals included for that accession). For the comparison between the epiGBS2 de novo and reference branches, methylation calls per sample were not averaged to generate an accession-level estimate and a 10× coverage threshold per sample was applied.

| Performance testing of epiGBS2 SNP calling
To test the SNP calling performance, all variants that were not SNPs

| Short description of execution of epiGBS2 pipeline
After executing a paired-end next generation sequencing run, the sequencing reads should be 5′-adapter trimmed as executed by most sequencing agencies, but custom parts (UMI, barcode, control nucleotide and restriction site overhang) should remain. The reads of individual samples are multiplexed, so two input files in fastq format will be received for the bioinformatics workflow: Read 1 (forward reads, usually indicated by "R1" in the file name) and Read 2 (reverse reads, usually indicated by "R2" in the file name 6. Check the status of the pipeline regularly for errors.

After the process has finished or an error occurred, inspect the
Snakemake output and the pipeline report.
8. Check output files as described in the documentation.
All predicted methylation sites are reported per sample in multiple output formats such as the cytosine report which describes the context, number of methylated and unmethylated calls, and methylation percentage per site. This format can be used as input in common downstream analysis packages such as meThylkiT (Akalin et al., 2012) or dSS. The SNP calls are summarized in a single multicohort vcf file, which can be used to perform downstream genetic analysis, such as genetic map construction, population genomics or phylogenetics.

| 3′ End adapter removal
To evaluate the performance of 3′ end Illumina and custom adapter removal as described in the Material and Methods section, the read adapter content of sample Cvi-0_11 was determined by FaSTqc.
Directly after demultiplexing and strand annotation, up to 18% of reads contained Illumina adapter sequences at the 3′ end ( Figure   S3). After the additional trimming and filtering step with cuTadapT, these sequences were removed. Consequently, average read size decreased from 140 to 130 bp for R1 reads and from 142 to 129 bp for R2 reads, but most reads (99.99%) were retained. The amount of reads with a high per-sequence base quality increased, the percentage of overrepresented sequences decreased, and no increase of cytosines (R1 reads) and guanines (R2 reads) was observed at the end of the reads after the trimming was performed.

| Creating de novo reference sequences from the epiGBS2 reads
To investigate the performance of the epiGBS2 de novo reference creation, the epiGBS2-generated reference clusters were mapped against the A. thaliana reference genome using minimap2 (Li, 2018). In total, 87.69% of the 19,766 de novo clusters uniquely aligned to the reference; 10,960 of these clusters aligned without any mismatches, and the mean number of mismatches between the TAIR_10 genome and the de novo reference was 1.1, indicating that the sequences of most of them were reconstructed successfully. It also demonstrates that the large majority of de novo clusters do not derive from contaminating DNA, such as endosymbionts. When mapping demultiplexed and trimmed epiGBS2 reads to the TAIR_10 genome using the reference branch, mapping percentages ranged between 50% and 70% depending on accession, with Col-0 performing best. The de novo branch performed similarly, with mapping percentages between 48% and 57%. This again indicates overall good performance of the de novo reference construction.
Next, we investigated the effect of the de novo cluster creation parameters (minimal depth of the first clustering step and clustering identity) on critical quality values of the clustering steps. When increasing the minimal cluster depth, the number of final de novo clusters decreases, while by increasing the identity percentage more clusters are created (Figure 5a). At high minimal cluster depth, and hence lower cluster numbers, the mapping percentage against the de novo reference increased (Figure 5b) because more reads could be mapped uniquely ( Figure S4). At lower minimal cluster depth, higher clustering identities result in higher mapping percentages; however, when the minimal cluster depth is increased, higher clustering identity does not lead to higher mapping percentage. These clustering parameter settings, by affecting the percentage of uniquely mapped reads, also affect average sequencing coverage per site, which is highest when the minimal cluster depth is higher and the identity percentage is lower (Figure 5c). Coverage, in turn, influences the methylation calls ( Figure S4). In conclusion, and similar to other de novo reference-based methods, we recommend exploration of the de novo reference creation parameters when working with new species to ensure optimal performance (Paris et al., 2017).

| epiGBS2 methylation calling
A. thaliana cytosine methylation was called from reference and de novo branch alignments. In total, 1,047,597 and 1,827,467 sites were obtained in the de novo and reference branch, respectively, of which 928,070 sites overlapped. The methylation calls of all sites were plotted against the calls of the 1001 epigenome project (baseline data). For both branches and all tested accessions, the epiGBS2 methylation calls in CG context were strongly correlated with the baseline methylation calls (R 2 > .93) ( Figure 6). In CHG context, the correlation was slightly lower (reference branch R 2 > .79, de novo branch R 2 > .8). For CHH context, very low correlations were found between the epiGBS2 calls and the baseline data set. Methylation in CG context is known to show high levels of transgenerational stability (Graaf et al., 2015), and therefore high similarities in CG methylation between individuals from the same shows that epiGBS2 methylation calling performs well. CHH methylation, on the other hand, is much more variable among conditions (Dubin et al., 2015) and hence a much lower correlation is expected, which may reflect biological difference rather than technical differences of the methylation calling procedure. Methylation calling of the legacy branch produced near-identical results to the epiGBS2 de novo and reference branches in all three methylation contexts ( Figure S5).

| epiGBS2 SNP calling
To evaluate the SNP calling, a comparison of filtered and unfiltered SNP calls between the epiGBS2 analysis and the baseline set was performed with Rtg vcfeval. The precision and sensitivity of epiGBS2 SNP calling were determined as a function of variant quality (QUAL), which is calculated by the variant caller (FreebayeS) and a phred-based metric describing the probability that a certain position contains a variant. Selecting for SNPs with higher quality scores is expected to lead to higher precision (more of the epiGBS2 SNP calls are correct) but lower sensitivity (because more sites, and thus also more true SNPs, are excluded from evaluation because they do not meet the quality filter threshold). At high SNP quality thresholds, more than 90% of the SNPs identified in the epiGBS2 reference branch corresponded with the baseline data set (Figure 7).
At lower quality values a larger total of true baseline SNPs were detected by epiGBS2, but this also resulted in an increased false positive rate (for instance, only half of the called SNPs are true positives at a quality value of 10; Figure 7). Because RRBS methods are prone to strand bias but SNPs in cytosine context can only be called using opposite-strand alignment, a subset of true baseline SNPs remain undetectable and limit the level of sensitivity that can be maximally achieved in the unfiltered data. When we excluded sites that were not covered at both strands from the epiGBS2-baseline comparison (the filtered data set), sensitivity increased: ~80% of the remaining SNPs in the baseline were detected by epiGBS2 (Figure 7). Because heterozygous genotypes were not called in the baseline set (1001Genome Consortium, 2016, we ignored the genotype calls during the SNP benchmarking by treating the genomes as haploid. Including the evaluation of correct genotype calling in the benchmarking, in addition to the correct allele, the overall level of precision was reduced in both filtered and unfiltered subsets ( Figure S6). Further investigation revealed this difference to be driven mainly by a fraction of true homozygous SNPs that were misidentified as heterozygous under epiGBS2. A similar excess of heterozygous SNP calls from WGS and simulated WGBS data from A. thaliana Cvi-0, compared to the 1001 genomes benchmark data, was observed in Nunn et al.
(2021) when using FreebayeS. This suggests that the FreebayeS tool, rather than inferring SNPs from bisulfite data, is causing the small discrepancy with the benchmark data.
To illustrate that epiGBS2 produces DNA methylation and SNP calls that differentiate the used A. thaliana accessions, we determined the relatedness between the samples. This was performed by calculating correlation and Euclidean distances for methylation F I G U R E 5 Effects of clustering parameter settings on de novo reference creation of Arabidopsis. thaliana. To evaluate the influence of clustering parameters on the de novo reference, the effects of minimal cluster depth of the first clustering step (x-axis: 10, 50, 100) and the last clustering step (cluster identity: black 95%, orange 97%, blue 99%) on the number of created de novo reference clusters (a), the percentage of read mapping against the de novo reference (b) and the average read depth per site after mapping (c) were determined and SNP calls, respectively, followed by clustering using the Ward method. Based on methylation data (Figure 8a,b) and on SNP calls ( Figure 8c, (Figure 9). Hence, due to the use of different REs, largely different parts of the genome were sampled, but within the existing overlap, methylation levels were very similar.

| DISCUSS ION
We presented here epiGBS2, consisting of a detailed description of our current laboratory protocol and a bioinformatics workflow for epiGBS analysis. epiGBS2 includes several major updates compared to the method as first published by van Gurp et al. (2016), a performance test and a user-friendly computational analysis pipeline aimed at the researcher with some basic experience in bioinformatics, such as working in a Linux environment. The updates were made in part to improve the method but also to make the method available in a more user-friendly and reproducible way to users working on a wide range of organisms. To date, published papers using epiGBS have been mainly limited to plant species (e.g. Alvarez et al., 2020;Moorsel et al., 2019;Mouginot et al., 2021;Mounger et al., 2021;Prudencio et al., 2018), but also include studies from vertebrates (Meröndun et al., 2019) and molluscs (Johnson & Kelly, 2020

| Validation of epiGBS2 results
Evaluation of the epiGBS2 pipeline performance was done by performing epiGBS2 analysis in several Arabidopsis thaliana benchmarking accessions, for which detailed SNP and DNA methylation information was available from previous whole-genome sequencing and BS-Seq studies. Comparing epiGBS2 results to these baseline data provided strong support that cytosine methylation levels are identified correctly in both CG and CHG contexts, in both the de novo and reference branches of epiGBS2. In CHH context, however, epiGBS2-based DNA methylation results did not correlate well with the benchmarking data. In plant genomes, DNA methylation in CHH context is known to be much more dynamic and environmentdependent than methylation in other cytosine contexts (Dubin et al., 2015). Thus, we interpret the lack of a good correlation in CHH context as probably reflecting real differences in CHH methylation between the samples that were used in our study and the samples that were used in the 1001 epigenomes project, our baseline data. Validation of epiGBS2 results was also conducted by comparing epiGBS2 and RRBS data from four great tit samples. Despite a relatively low number of shared sites for this specific comparison, comparing epiGBS2 to the RRBS data supports that cytosine methylation levels in CG context are correctly identified. We expect the number of shared sites and the correlation to be even higher if the same restriction enzymes were used and if the computational analysis protocol had been identical. However, the use of an MspI restriction digest without NsiI as usually done in RRBS has to be evaluated in future studies, as we did not investigate the performance of epiGBS2 for enzymes with symmetric cut-sites. Nonetheless, the observed correlation between the P. major epiGBS2 samples and the RRBS data shows that epiGBS2 methylation calling performs well in a vertebrate species.

| Limitations and future work
Making epiGBS2 available allows others to use the updated methodology. However, we note that the development of epiGBS2 is a continuous process. Some known but minor shortcomings of the bioinformatics procedures are as follows. For instance, during demultiplexing the presence of the expected RE overhangs is validated. This validation accepts one nucleotide mismatch to allow recognition of C-to-T converted RE overhang sequences after bisulfite treatment.
If a mismatching nucleotide is identified (e.g., a T instead of a C), it is replaced with a C by the stacks2 code; this can effectively result in an unmethylated cytosine becoming labelled as methylated. One possible solution could be for a script to approve the remaining RE overhang and allow C/T conversions without replacing them.
Another known issue is that we often find that 20%-30% of all sequencing reads cannot be demultiplexed successfully due to ambiguous barcodes. We identified two possible contributors to this issue: (i) in all epiGBS2 data sets that we inspected, the de novo barcodes recognized by STackS2 contain a notable number of sequences that resemble existing barcode combinations but that miss the first nucleotide of the R1 barcode. When inspecting untrimmed reads with the UMI still attached, we observed that the missing barcode nucleotide is typically present but only two UMI nucleotides are identified. This indicates that the issue may be caused by the UMI sequences. UMI sequences were synthesized at random, so it is possible that some low-complexity UMIs are generated that prevent optimal phasing of the sequencing. (ii) In epiGBS2 data sets with high PCR duplication rates, we identified de novo barcode combinations with an existing R2 barcode sequence, which we also found in the R1. However, in this combination, these specific barcode pairs were not expected. A similar issue was noted by Trucchi et al. (2016) in a related protocol, which was attributed to improper annealing during PCR of the forward primer to the CO adapter and which may be remedied by modifying the primer design (Trucchi et al., 2016). Another improvement could be made in achieving uniformity in sequencing output per sample. The epiGBS2 protocol prioritizes efficiency and cost-effectiveness in library preparation; however, in our hands, we observe considerable between-sample variability in read counts after sequencing. This might be improved by conducting size selection of fragments after RE digest for each individual sample before adapter ligation. Alternatively, improvement can be achieved by quantifying DNA amounts per individual sample with a qPCR step after adapter ligation. This makes the library preparation process more elaborate, but may provide more control over individual sample output. Alternatively, the sequencing performance of individual barcodes and barcode combinations could be benchmarked in more detail.

| CON CLUS ION
We have here presented and validated epiGBS2, an updated method and analysis pipeline of previously published epiGBS approaches and showed that it performs well in terms of methylation and SNP calling compared to "gold standards" such as WGBS. The detailed description of the adapter design and the use of a CN allows for flexible use of different restriction enzymes and makes epiGBS2 applicable to a wide range of organisms and applications.
However, before starting with new organisms and RE combinations we highly recommend to (i) estimate the expected complexity reduction by an in silico digestion of the genome of the species of interest or a related species, (ii) optimize the adapter concentrations as described in Wallace and Mitchell (2017) and (iii) run the de novo reference clustering with different parameter settings as those will influence the mappability of the sequencing reads.
Embedding the analysis pipeline in Snakemake makes the execution of the pipeline more accessible to nonbioinformaticians, more user-friendly and reproducible, and allows the user to exchange the tools for mapping, methylation and SNP calling to software of their own choice.

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interest to declare.

O PE N R E S E A RCH BA D G E S
This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The bioinformatics pipeline and documentation can be accessed on github (https://github.com/nioo-knaw/ epiGBS2) and was deposited on zenodo (https://doi.org/10.5281/ zenodo.4764652).

DATA AVA I L A B I L I T Y S TAT E M E N T
The demultiplexed A. thaliana epiGBS data were deposited on NCBI with BioProject ID PRJNA764918 and raw data files are accessible on Zenodo (https://doi.org/10.5281/zenodo.5519370). Great tit data can be accessed on NCBI with BioProject ID PRJNA208335. The RRBS reads are available under the SRA accessions SRX10989860, SRX10989861, SRX10989862 and SRX10989863, and the epiGBS2 reads are available under the SRA accessions SRX10994457, SRX10994458, SRX10994459 and SRX10994460. The epiGBS2 laboratory protocol can be found in the Supporting Information.