Variant calling: Considerations, practices, and developments

Abstract The success of many clinical, association, or population genetics studies critically relies on properly performed variant calling step. The variety of modern genomics protocols, techniques, and platforms makes our choices of methods and algorithms difficult and there is no “one size fits all” solution for study design and data analysis. In this review, we discuss considerations that need to be taken into account while designing the study and preparing for the experiments. We outline the variety of variant types that can be detected using sequencing approaches and highlight some specific requirements and basic principles of their detection. Finally, we cover interesting developments that enable variant calling for a broad range of applications in the genomics field. We conclude by discussing technological and algorithmic advances that have the potential to change the ways of calling DNA variants in the nearest future.


| INTRODUCTION
Genome and exome sequencing techniques are rapidly replacing genotyping microarrays in diagnostic and prognostic studies.
A key advantage of new technologies is their ability to give a nearly complete view on the content of our genomes with digital precision. With recent improvements in scalability and constantly decreasing prices, genome sequencing has all potential for being a method of choice for a broad range of genomics studies.
However, as a relatively new invention that also has multiple technological peculiarities, the analysis of genome variants using sequencing data is less standardized compared to genotyping arrays. In this review, we will discuss current approaches to variant calling using sequencing data and specifically focus on considerations while planning genomics experiments, practices while executing them, and promising developments to consider for future projects.

| CONSIDERATIONS
DNA sequencing is popular among researchers and companies, one cannot complain about the limited choice of platforms and kits for preparing and sequencing the samples. Before engaging in a genomics project using sequencing, it is a good idea to consider different options and choose the one that matches your research question and provides the most cost-efficient solution.

| DNA isolation and fragmentation
The basic experimental procedures such as DNA isolation can have systematic effects on the representation of different genomic regions (van Heesch et al., 2013) and, as a consequence, on variant detection. Also, DNA fragmentation, a necessary step for short-read library preparation, can be achieved through a variety of methods, such as mechanical fragmentation, sonication, acoustic and hydrodynamic shearing, chemical or enzymatic fragmentation. The most recent addition to the portfolio of fragmentation methods is Tn5 transposases that can "deliver" polymerase chain reaction (PCR) or sequencing adaptors directly to DNA (Picelli et al., 2014). Choices on DNA sources, isolation and fragmentation methods can vary from project to project, but it is a good idea to agree on these standards before starting a collaborative project involving multiple research groups. The quality and quantity of genomic DNA determine the need for PCR amplification and can greatly affect the quality of variant calls.

| PCR duplicates
Redundant reads originating from the same DNA fragment introduced during the PCR amplification of a library are common artifacts created during genome sequencing. Such artifacts can falsely increase allele frequency or even introduce erroneous mutation calls. There are several ways to either remove or mark the PCR duplicates to lower the false-positive calls. When sufficient input material is available, modern protocols allow for amplification-free library construction. When the input material is scarce and does not meet requirements for PCR-free library preparations, the individual molecules can be barcoded by adaptors with unique molecular identifiers (UMIs). Even after the PCR amplification of such a library, UMIs will help to discard PCR duplicates. When neither PCR-free prep nor UMIs were used, it is a common practice to mark additional reads with shared mapping coordinates as potential duplicates. However, this computational method might overcorrect (duplicated regions) or miss (repetitive regions) the real duplicates.

| Genome coverage
Choice of sequencing strategy has an important effect on the average depth of the genome coverage. Short-read whole-genome sequencing offers the most complete approach and typically yields 30×, while long reads with lower per-base quality are routinely sequenced to 60× coverage. Targeted resequencing, that involves selection of genomic regions, exhibits a rather nonuniform coverage of target regions (e.g., whole-exome sequencing). As sufficient coverage is a key parameter in variant calling, exome sequencing employs an even higher 90×-100× average coverage to compensate for uneven coverage among selected regions. Interestingly, some projects deliberately turn to a lower sequencing depth, when sequencing related individuals or looking for common association signals in a large population (CONVERGE Consortium, 2015; Genome of the Netherlands Consortium, 2014).

| Platform and read length
Short reads sequencing (typical reads span several hundred bases) is currently the most popular and cost-effective strategy for variant detection. As price is typically proportional to the number of bases read, the preference should be given to the longest possible reads (available for this platform) and paired-end over single-end mode ( Figure 1). This will allow to uniquely address a bigger proportion of the genome, improve detection of structural variants and improve the performance of read assembly methods. Overall, short read-based methods are well-established, cheap, have low error rates, but can fail to distinguish recently duplicated, repetitive sequences and can under-represent DNA segments with very high or low GC content.
On the other hand, long reads can cover tens and hundreds of F I G U R E 1 Overview of experimental factors that are important for planning and performing a genome sequencing study ZVERINOVA AND GURYEV | 977 kilobases, which is advantageous for calling of larger structural variants as they can span over large repeats and regions with GC content bias. However, the costs per base, as well as error rates, are higher than those for short reads. Several interesting alternatives to obtain high-quality data with perfect mapping ability of long reads rely on advanced barcoding of short reads (linked-reads), circular consensus reading (PacBio high fidelity reads), and combining long-and short-read sequencing data (Wenger et al., 2019;Zhang et al., 2020).
To conclude, we tried to outline some of the important choices that need to be made at the beginning and might affect the efficiency of calling and completeness of the resulting variant set. When your experience with variant calling is rather limited it is recommended to involve a data analysis specialist at the stage of project conception rather than after data is produced. This can help to avoid issues of lacking power or poor ability to call certain types of variants from one side and wasting the resources from another. and unlocalized contigs will prevent the mismapping of reads originating from these sequences and prevent many false-positive calls. It is a general recommendation to use the so-called primary assembly version (chromosomes, mtDNA, unplaced and unlocalized contigs) of the reference genome unless a specific aim of the study calls for the use of an extended version.

| Aligning
To avoid the multiple sources of errors in data, such as amplification biases, errors introduced during sequencing and base calling, and mapping artifacts that arise during reads alignment, the data must be appropriately pre-processed. At this stage, it is preferable to prioritize sensitivity over specificity, to ensure the abundance of potential variants, rather than missing any. A good example of balanced preprocessing steps is described in Genome Analysis ToolKit (GATK) best practices tutorial (DePristo et al., 2011;GATK_Team, 2021), and can be used as a guide to constructing a workflow according to the established and validated analysis principles. The pipeline uses a BWA-MEM aligner, a robust mapping algorithm, to map the sequence data to a reference genome (H. Li, 2013). An optional next step involves marking PCR duplicates for PCR-amplified libraries, for example using SAMtools package (H. Li et al., 2009) or according to manufacturer recommendation if UMIs were employed. GATK is a set of tools that can be easily used to perform the rest of preprocessing and variant calling routines. A common practice in data preprocessing is base quality score recalibration (BQSR). Base quality is a confidence score provided by the sequencer for each DNA base in the data and often needs recalibration to correct for any systematic bias, which can arise during library preparation and genome sequencing. The recalibration process uses covariate measurements from the base call in the data set to build and apply an adjustment model, resulting in highly accurate base quality scores.

| Calling small variants
The most common type of changes observed in our genomes is a single nucleotide variant (SNV), a substitution of a single base at a certain position in the genome. A typical genome sequencing project detects several million SNVs per individual human sample. When reads are properly aligned to the genome reference, calling a standalone single-base "typo" is rather straightforward. Apart from the GATK program suite mentioned above, several others such as BCFtools and FreeBayes are popular tools that can generate a list of small variants in a computation-efficient way (Table 1) (Danecek et al., 2021;Garrison & Marth, 2012). Interestingly, summary characteristics of SNVs can serve as quality control parameters in population sequencing studies. Thus, a biased transition to transversion ratio can indicate errors in SNV analysis or a presence of strong mutational bias in the genome.
Other small variants that lead to small insertion or deletion between 1 and 20 DNA bases are commonly known as indels. These changes are shorter than a sequencing read and can be found by realigning partially mapped or unmapped reads using split-read aligners (Ye et al., 2018).
The biggest challenges in calling small variants present themselves in hypervariable and repetitive regions of the genome.
Both tightly clustered variants and look-a-like regions in the genome present problems for small variant calling. The former can be alleviated by haplotype-aware calling where, instead of "walking" genome one position at a time, polymorphic regions are located and their sequences are assembled from local sequencing reads, resulting in longer stretches of allelic sequences that can be mapped more accurately. The latter problem can be resolved by using longer sequencing reads that overlap a sufficient number of bases that are different between otherwise very similar regions of the genome.

| SV calling
Larger variants that encompass 20 and more DNA bases, also known as structural variants (SVs) (ii) Altered distance between paired reads. When the distance between mapping positions of paired reads significantly deviates from the expected size of the DNA fragment, it may indicate local DNA content difference between reference and genome of the patient. A longer distance between mapped paired reads indicates more DNA bases in reference than in the patient sample and hence deletion, while a shorter distance can be a hallmark of an insertion.
(iii) Discordant orientation of paired reads. When aligned, paired reads typically map to opposite DNA strands. In rare cases when one read from a pair overlaps the breakpoint of an SV, the relative mapping of reads in a pair can exhibit unexpected directionality.
Thus, if one of the reads ends in an inverted segment, both reads would map to the same DNA strand. Similarly, tandem duplications frequently show inverse read order (upstream read on reverse, downstream read on forward strand) since, in the patient DNA, these reads belong to two sequential copies of the same reference DNA segment.
(iv) Unmapped reads. When an alignment of a read to the reference genome fails, it might indicate that the read overlaps a breakpoint of a structural variant. In this case, split-mapping of this read may reveal the large variants. With many possibilities of arbitrary splitting of an unmapped short read, it is difficult to avoid mismapping. However, the genomic coordinate of the mapped "mate" read from the same pair can be used as a proxy to find the start of the SV breakpoint.
The state-of-the-art SV discovery combines these aforementioned SV signatures in a clever way. Thus, a relative position of read pairs allows discrimination between tandem and dispersed CNVs, which is impossible from the read density analysis alone. The robust and specific SV calls, therefore, rely on multiple independent read pairs and are supported by multiple SV signatures (Figure 2b).  In some of the applications, the assumptions underlying SV signatures might be violated. For example, read density in exome data is very nonuniform and CNV detection methods need to discriminate between effects of change in copy number and technical variability in fragmentation and target selection efficiency. This is especially challenging for small sample sets and single-exon CNVs (Gordeeva et al., 2021).
It is worth noting that read position/orientation and especially read depth signatures cannot resolve an SV breakpoint at base-pair resolution. In this situation, split-read mapping and local de novo assembly help with fine-mapping of the SV coordinates. Expectedly, the longer the repeat, the more unstable it will behave during DNA replication. Longer repeats of this kind have been previously implicated in multiple genetic disorders such as fragile X syndrome or amyotrophic lateral sclerosis (Swinnen et al., 2020).

| Other and exotic variant types
Allelic variants with such repeats, especially high-risk alleles, frequently exceed read and even fragment size typical in short-read sequencing, making their identification challenging. Several software packages were developed to estimate the length of such repeats from genome sequencing data (Bahlo et al., 2018).
Next to tandem repeats, there are also dispersed repeats appearing in our genomes. A common type of dispersed repeats is mobile elements (MEs)-"parasitic" DNA sequences with the ability to copy and/or propagate themselves across DNA. Autonomous elements, such as long interspersed nuclear element 1 (LINE-1) encode for enzymes that can convert its RNA to DNA and insert an extra copy of itself into the genome. Nonautonomous elements are typically short (e.g., SINEs) and use enzymes produced by their autonomous "cousins" for their propagation. Nearly half of our genome may in fact be the result of the past activity of mobile elements.
The arms race between these "egoistic" DNA elements and cellular processes suppressing their replication limits the number of functional full-size copies around a hundred LINEs per human genome (Penzkofer et al., 2017). The sheer presence of ME-derived sequences in genomes makes identification of their new insertions challenging, especially so when a new copy "lands" inside another repeat. Several software packages utilize read pair, split-read, and local assembly to locate and genotype mobile element insertions (MEI) in WGS data (Chu et al., 2020;Gardner et al., 2017;Thung et al., 2014;Torene et al., 2020).
The presence of reverse transcription (RNA-to-DNA) functionality opens a possibility for other RNA molecules to "enter" our genomes. In rare cases, a regular processed human mRNA can be Another genetic mechanism that can result in large-scale changes in our DNA is non-allelic gene conversion (NAGC)-a consequence of improper DNA repair event when a highly similar, but not identical copy of segment is used as a template during homologous recombination repair (Harpak et al., 2017). This mechanism tends to eliminate differences between recently duplicated regions, but can also result in loss of function events when a functional copy of a gene is rewritten using its pseudogene as a template. Such events seem to be rare, difficult to detect from short sequencing reads, and, at a first sight, can look like a set of multiple independent polymorphisms.
Many of the challenges in SV discovery mentioned above are caused by limitations of short-read sequencing and will be solved through democratizing long-read sequencing, with most of SVs being detected from long reads alignments (Miga et al., 2020). and methods (split-read mappings, gapped alignment, de novo assembly) for the identification of complex and structural variants (Rausch et al., 2012). However, as there is still no single tool that has mastered all methods and variant types, consolidation of variant call sets from multiple tools has become a popular approach (Zarate et al., 2020).
Next to this synthesis, the new methods are constantly added to our toolbox (Figure 3). With the increasing popularity of new deeplearning approaches, such as convolutional neural networks, computers can generate pileup image representations and call variants from them. An implementation of such a deep-learning algorithm can call germline SNVs and indels and essentially requires just a single parameter defining the minimum variant quality threshold (Poplin, Chang, et al., 2018).

| Variant calling for other data types
Due to the popularity of sequencing methods, they are also frequently employed for profiling transcriptomes, epigenomes, regulatory proteins attached to DNA or RNA, and so on. These data types can have several important differences that need to be taken into account when calling variants.
Similar to exome sequencing, other data types do not necessarily have a uniform coverage throughout the whole genome, so that one cannot rely on the completeness and or same detection power for the variants called from such data.
Alongside, data set-specific peculiarities need special attention by variant calling software. A good example is gapped alignment common for splicing events in RNA-seq data. It is important to prevent misalignment cases due to assigning small trailing parts of the read to an intron. Thus, transcriptome aligners have functionality for two-stage mapping: identifying all novel splice junctions first (preferably for all samples in the study) and then using them in the second mapping stage to avoid misalignments and false SNPs (Dobin et al., 2013). Further, gaps in the alignment may complicate the variant discovery process, hence some software solutions such as GATK split exonic blocks into multiple supplementary alignments to simplify variant calling.
Another peculiarity of non-genomic data is that some of the

| Phasing variants
In some cases, it is not sufficient to know genotypes to evaluate their involvement in disease. If multiple potentially deleterious alleles exist in a gene, it is important to understand if a single or both copies of a gene are affected, which requires phasing of alleles into haplotypes

| Completeness of genome reference
The non-reference sequences (NRS) represent genomic segments detected in sequencing data, which are currently not represented in the human genome reference. These sequences are often polymorphic in the population and thus can be considered as a type of structural variation (Hehir-Kwa et al., 2016;Naslavsky et al., 2020).
The NRS may contain new complete or parts of genes as well as noncoding DNA sequences potentially influencing transcription.
Integration of data from various genome sequencing projects will help to identify, catalog, map and characterize these segments from the whole human population and build a pangenome reference for the unbiased interpretation of genomics data.

| Long reads technologies
Third-generation sequencing technologies (e.g., developed by Pacific Biosciences and Oxford Nanopore) offer numerous advantages over short-read sequencing. Their main strength lies in the ability to routinely produce reads that are tens and hundreds of thousands of kilobases long. Such long reads simplify de novo assembly, improve the unambiguity of read mapping, and the detection of structural variants. Long-read platforms are also able to provide insight into the nature and frequency of nucleotide modifications, such as methylation. Nevertheless, long-read sequencing is still more costly than short-reads technology and can be expensive for large sequencing projects. Further, their relatively high base error rate usually requires consensus polishing of high coverage data or combining of long-read and short-read data to achieve maximum base call accuracy. The recent and remarkable advances in long-read technologies, such as high fidelity (HiFi) reads, provide proof of principle that these technologies are instrumental to detect the most challenging variants, such as mid-sized SVs on repetitive sequence background, and decode end-to-end representation of human chromosomes (Miga et al., 2020). The resulting completion of human chromosome X demonstrated the possibility to sequence the genome with only a few remaining gaps and the need for further development of methods for the analysis of segmental duplications and satellite arrays.

| CONCLUSIONS
In this paper we reviewed the common practices of variant calling, focusing on preparatory steps, decisions to be made, diversity of variant types, and basic principles of their identification. As the pace of technological developments in genome sequencing does not seem to slow down, we can expect that many obstacles we encounter today will disappear. As a result, researchers would be able to obtain a comprehensive set of variants more easily and dedicate more of their time to clinical and biological questions.

ACKNOWLEDGMENTS
We thank members of the Genome Structure and Ageing team of the European Research Institute for the Biology of Ageing for fruitful discussions and their comments during the preparation of the manuscript. The authors have no conflict of interest to declare.