“Escalibur”—A practical pipeline for the de novo analysis of nucleotide variation in nonmodel eukaryotes

Abstract The revolution in genomics has enabled large‐scale population genetic investigations of a wide range of organisms, but there has been a relatively limited focus on improving analytical pipelines. To efficiently analyse large data sets, highly integrated and automated software pipelines, which are easy to use, efficient, reliable, reproducible and run in multiple computational environments, are required. A number of software workflows have been developed to handle and process such data sets for population genetic analyses, but effective, specialized pipelines for genetic and statistical analyses of nonmodel organisms are lacking. For most species, resources for variomes (sets of genetic variations found in populations of species) are not available, and/or genome assemblies are often incomplete and fragmented, complicating the selection of the most suitable reference genome when multiple assemblies are available. Additionally, the biological samples used often contain extraneous DNA from sources other than the species under investigation (e.g., microbial contamination), which needs to be removed prior to genetic analyses. For these reasons, we established a new pipeline, called Escalibur, which includes: functionalities, such as data trimming and mapping; selection of a suitable reference genome; removal of contaminating read data; recalibration of base calls; and variant‐calling. Escalibur uses a proven gatk variant caller and workflow description language (WDL), and is, therefore, a highly efficient and scalable pipeline for the genome‐wide identification of nucleotide variation in eukaryotes. This pipeline is available at https://gitlab.unimelb.edu.au/bioscience/escalibur (version 0.3‐beta) and is essentially applicable to any prokaryote or eukaryote.

been a major expansion in genomic and associated proteomic and transcriptomic data sets for socioeconomically important parasitic worms of animals and plants (Kikuchi et al., 2017;Ma et al., 2020;Wit & Gilleard, 2017). This major and rapid progress has meant that the focus is now much more on analyses of massive genome data sets, which demand novel, efficient, reliable and reproducible tools as well as solutions to deal with this "data overload". Although a number of software workflows, such as CloudMap (Minevich et al., 2012), ChurChill (Kelly et al., 2015), togglE (Monat et al., 2015) and SarEk (Garcia et al., 2020), have been developed to handle and process such large data sets and to carry out biological and/or population genetic analyses of them, there has been somewhat of a delay in the creation of effective, specialized workflows and pipelines for genetic and statistical analyses.
Most available workflows use the Genome Analysis Toolkit (gatk) (McKenna et al., 2010) for the identification and recording of nucleotide variation (i.e., "variant calling"). Commonly, the quantification of genetic variation within well-known and well-studied species relies on the use of one or more high-quality reference genomes (NCBI) and variome data sets (e.g., for humans or mice; https:// www.human vario mepro ject.org/) available in public or specialized databases. However, for most species, such extensive resources are not available. Frequently, an incomplete or fragmented genome is available, and sometimes little or nothing is known about a taxon's species-status, its ploidy, whether sibling species exist, and/or whether genomic rearrangements occur within or among populations. Depending on the method used to collect samples, there may also be challenges linked to contamination with extraneous DNAs (e.g., viruses, bacteria and fungi). Thus, for ab initio studies of genetic variation in "lesser studied" eukaryotes, there is a need to ensure that well-defined steps are put in place for the selection of suitable reference sequences (draft or reference quality) and that extraneous/nonspecific ("contaminating") DNA sequences are removed prior to analysis.
For these reasons, we developed here a highly efficient, scalable pipeline for the genome-wide analysis of nucleotide variation in eukaryotes. This pipeline, called Escalibur, has key steps, including nucleotide sequence trimming and mapping; selection of an optimum reference genome; removal of contaminating sequences; calibration; and variant calling. Escalibur version 0.3-beta (available at https:// gitlab.unime lb.edu.au/biosc ience/ escal ibur) is suitable for the analysis of genetic variation among hundreds to thousands of individuals or samples of a eukaryotic species.

| MATERIAL S AND ME THODS
The Escalibur version 0.3-beta pipeline is presented schematically in Figure 1. It integrates the processing steps using the workflow language WDL version 1.0 (https://github.com/openw dl/wdl), with the steps executed employing the CroMwEll Workflow Management System version 5.0 (https://github.com/broad insti tute/cromwell).
CroMwEll scales from stand-alone servers to highly distributed computing environments, and example configuration files for both stand-alone and distributed computing environments are provided.
The WDL-based implementation and distribution of the pipeline follow the findable, accessible, interoperable and reusable (FAIR) principles (Wilkinson et al., 2016), but can also be readily modified and expanded as required by the user. This pipeline uses a singularity container (Kurtzer et al., 2017) to equip 16 programs (Table S1). In the following, we focus on the three key steps for the identification of genome-wide nucleotide variability. Each step can be run in a separate workflow in a sequential manner; step 2 is optional.

| Step 1. Read-filtering and mapping
Raw paired-end (PE) short-read DNA sequence data (e.g., Illumina or BGISEQ) in FASTQ format (Cock et al., 2010) are optionally filtered for quality (default Phred cut-off: 20) and adapter-trimmed using the program triMMoMatiC version 0.39 (Bolger et al., 2014). Readpairs are then mapped to a well-defined (preferably chromosomecontiguous) reference genome sequence using the program Burrows-Wheeler Aligner (bwa) MEM version 0.7.12 (Li & Durbin, 2010), PCR duplicates are marked and the resultant BAM files are merged per sample. If multiple genomes or draft genomes are available for a particular species/taxon, the algorithm identifies the most suited ("optimum") reference genome based on mapping and read coverage averages, providing a sound basis for the identification of nucleotide variations.

| Step 2. Quality assessment, and removal of extraneous (contaminating) sequence reads
The quality statistics of trimmed sequence data are calculated using faStqC version 0.11.9 (https://www.bioin forma tics.babra ham.ac.uk/ proje cts/fastqc). Mapping statistics of trimmed data from each sequence library are calculated using SaMtoolS StatS version 1.10 (Li et al., 2009); subsequently, key mapping metrics, such as the total numbers of mapped and unmapped reads, percentage of paired reads and the total number of bases mapped, are summarized for each library. To remove extraneous, contaminating sequence reads from the aligned sequence data for individual samples, reads that map to the reference genome are mapped to a spectrum of userselected genome sequences available for other organisms using bwa; for example, in the case of a eukaryotic pathogen, reads might be mapped to prokaryotic microbes and the host species. Finally, customized python scripts are used to identify and remove contaminating sequences following the latter mapping phase.

| Step 3. Calibration of data and variant calling
The mapped data are then used to identify single nucleotide polymorphisms (SNPs) at individual positions and insertion/deletion events (indels) in relation to the optimum reference genome sequence using gatk version 4.1.3.0. In brief, base quality scores of "raw," aligned sequence-read data are recalibrated twice based on the nucleotide variations recorded. Subsequently, for each sample sequenced, SNP sites and indels are identified using the gatk HaplotypeCaller routine and merged into one variant call format (VCF) file, which lists all variable sites for all samples under investigation using gatk CombineGVCFs and GenotypeGVCFs routines. Raw SNP sites and indels are filtered for quality using the gatk VariantFiltration routine, employing best practice for gatk (https://softw are.broad insti tute. org/gatk/best-pract ices). (genotype-quality) in vCftoolS version 0.1.17 (Danecek et al., 2011).
Note that all filtering parameters above can be configured.

| RE SULTS
To validate the Escalibur pipeline, we downloaded the soft-filtered, uncalibrated variant data set (v20210121) from the Caenorhabditis elegans Natural Diversity Resource (CeNDR) (Cook et al., 2017). Raw data for the following strains: CB4856 (NCBI accession nos. SRR3440952, SRR3441150, SRR3441428, SRR3441550, SRR3441658 and SRR3452182) and CB4854 (SRR3452139, SRR3441401 and SRR3441123) were obtained from NCBI. These data sets were run separately through Escalibur using the genome assembly of the Bristol N2 strain of C. elegans (GCA_000002985.3; file GCF_000002985.6_ WBcel235_genomic.fna.gz) as a reference. Reads were trimmed and mapped, then assessed for potential contaminating reads originating from bacteria (GCF_001484935.1); initial base-calls remained uncali- The resultant filtered homozygous variants were very similar to those in the CeNDR database (Table 1). The numbers of SNPs F I G U R E 1 Schematic representation of the genomic analysis pipeline, Escalibur. In step 1, paired-end (PE) sequence reads in each genomic library are trimmed; resultant reads are mapped to all reference genomes; binary sequence alignment/map (BAM) files for each sample are combined; and PCR duplicates are marked. The optimum reference genome is established based on mapping-rate and readcoverage averages. In step 2, the mapping quality is assessed, extraneous "contaminating" sequences (originating, for example, from the environment, microbes and/or host species) are removed and BAM files are corrected accordingly. In step 3, using gatk, the corrected BAM files are optionally recalibrated; the base quality score recalibration (BQSR) is applied in two iterations against high-quality variants; initial variants are called; variant call format (VCF) files are created; and variants are then filtered to retain only those which are statistically significant called in Escalibur for CB4854 (n = 80,207-80,228) and CB4856 (n = 210,552-211,244) were lower than those in the CeNDR database (i.e., 90,032 and 214,914, respectively). Uncalibrated data had more common SNPs compared with calibrated data for both CB4854 (n = 74,411 vs. 74,344) and CB4856 (n = 189,597 vs. 187,784).
Consequently, smaller numbers of unique SNPs were recorded in uncalibrated than in calibrated data from CeNDR-that is, CB4854 Similar results were achieved for indels, for which the counts called using Escalibur for data relating to strains CB4854 (n = 26,076-26,081) and CB4856 (n = 68,216-68,982) were lower than those obtained from CeNDR (i.e., 42,261 and 95,513, respectively). Comparing Escalibur with CeNDR, more common indels were recorded in uncalibrated than in calibrated data for strains CB4854 (n = 9,354 vs. 9,662) using Escalibur. Consistently, more SNPs and indels were shared between resultant Escalibur and CeNDR data using uncalibrated rather than calibrated data sets-a result that was expected due to the lack of a calibration step for CeNDR data. When the effect of using calibrated vs. uncalibrated data was assessed, there was no marked increase or decrease in the numbers of variants called (Table 1), with small percentage differences (usually ~0.5% to 2.5%) in the SNPs and indels called. were measured using a stand-alone Linux server, equipped with 48 processing cores and 512 GB of RAM, when running data for both strains (i.e., CB4854 and CB4856) together (Table S2). Calendar time varied, depending on the allocation of available processing cores to programs in configuration files and on generic load of the server during the runs.

| DISCUSS ION
Exploring genetic variation within species of nonmodel organisms, such as disease-causing pathogens, can have important implications, for instance, for understanding the responses of a population to selection pressures, such as environmental changes and drug treatment, and can be central to elucidating pathogen ecology and epidemiology. The use of advanced informatics is enabling population genetic analyses of DNA sequence data sets produced using the latest DNA sequencing technologies. However, the processing TA B L E 1 Homozygous variants called in data representing strains CB4854 a and CB4856 b of Caenorhabditis elegans using Escalibur compared with variants recorded in the CeNDR database. Variants were called separately using calibrated and uncalibrated data of large genomic data sets for "lesser-studied" organisms has been challenging, and automated pipelines for the rapid analysis of such data sets are scant.
Escalibur uses proven CroMwEll pipeline execution software that allows its integration into a range of computing environments, is reliable and scalable, and supports call-caching. Therefore, the pipeline can resolve critical and computationally demanding, mapping and calibration steps for SNP-calling in an automated manner. Moreover, the use of the WDL workflow language allows end-users to expand and modify the pipeline to suit their needs, as required. Currently, WDL can also be used with execution engines other than CroMwEll, such as Miniwdl (https://github.com/ chanz ucker berg/miniwdl; October 13, 2021) and toil (Vivian et al., 2017). However, this workflow has not been tested on these engines. The use of a container to run software packages spares the end-user from deploying the required tools and their dependencies, which would, otherwise, be a daunting task for researchers without extensive expertise in bioinformatics. Regardless, a familiarity with WDL would be beneficial to addressing issues if they were to arise when running the pipeline.
We limited Escalibur to the gatk variant caller due to gatk's extensive use in the community and its consistent and versatile applicability to data produced on different sequencing platforms (Chen et al., 2019 Here, our goal was to integrate into Escalibur functionalities, such as the selection of the best reference genome and the removal of bias or errors caused by potential contaminating sequences, which are not included in available pipelines. The pipeline does not aim to remove contamination from reference genome sequences, rather only from mapped read data, using user-selected genomes of organisms or microorganisms that might be possible contaminants. The selection of reference genomes for potential contaminants is usually based on prior knowledge of typical contaminants, or an exploration of the sequence data obtained using programs such as CEntrifugE (Kim et al., 2016). Our purpose here was to increase functionality, and not to compare the performance of our workflow with that of other published pipelines, as it is well recognized that differing experimental conditions (e.g., quality and preparation of genomic DNA) and procedures (type of library construction and sequencing method) (Chen et al., 2019;Pei et al., 2021) can influence analyses, depending on the workflow system, and thus render the