VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions

Abstract The Human Genome Variation Society (HGVS) variant nomenclature is widely used to describe sequence variants in scientific publications, clinical reports, and databases. However, the HGVS recommendations are complex and this often results in inaccurate variant descriptions being reported. The open‐source hgvs Python package (https://github.com/biocommons/hgvs) provides a programmatic interface for parsing, manipulating, formatting, and validating of variants according to the HGVS recommendations, but does not provide a user‐friendly Web interface. We have developed a Web‐based variant validation tool, VariantValidator (https://variantvalidator.org/), which utilizes the hgvs Python package and provides additional functionality to assist users who wish to accurately describe and report sequence‐level variations that are compliant with the HGVS recommendations. VariantValidator was designed to ensure that users are guided through the intricacies of the HGVS nomenclature, for example, if the user makes a mistake, VariantValidator automatically corrects the mistake if it can, or provides helpful guidance if it cannot. In addition, VariantValidator has the facility to interconvert genomic variant descriptions in HGVS and Variant Call Format with a degree of accuracy that surpasses most competing solutions.


INTRODUCTION
The Human Genome Variation Society (HGVS) nomenclature for the description of human sequence variants  is widely adopted by scientific journals and variant databases and is endorsed by professional organizations (Deans, Fairley, den Dunnen, & Clark, 2016;Richards et al., 2015;Tack, Deans, Wolstenhome, Patton, & Dequeker, 2016). As high-throughput sequencing has become more common, HGVS recommendations have evolved to communicate a plethora of new variants to the scientific and healthcare communities (Taschner & den Dunnen, 2011). This has resulted in some aspects of the nomenclature being somewhat difficult to comprehend and use, for experts and non-experts alike, and so has resulted in many instances of inaccurate communication of variant data. Consequently, high-quality user-friendly tools are required to help investigators validate variant descriptions to ensure that the This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. c 2017 The Authors. Human Mutation published by Wiley Periodicals, Inc. described variant is valid and consistent with the predicted phenotypic effect. There is also a need for high-quality tools that can convert high-throughput sequence variation descriptions (e.g., the Variant Call Format [VCF] https://github.com/samtools/hts-specs) (Danecek et al., 2011) into accurate descriptions of each variant using HGVS nomenclature with respect to all relevant reference sequences (i.e., genomic reference sequences and transcript reference sequences), and vice versa.
We have built a simple and intuitive Web interface, VariantValidator (https://variantvalidator.org/), which harnesses and automates the key components of the hgvs Python package (Hart et al., 2015). VariantValidator has been designed to provide users with informative guidance relating to any variant-description errors, which may have been detected, rather than terse error messages.

What is the purpose of VariantValidator?
The hgvs Python package is a powerful tool for: (a) validating HGVS is also a tool that converts variant data from VCF files and feeds them directly into the batch validation tool. We currently implement a fair usage policy limiting the batch tools toward processing 20,000 variants in a single job. However, we are in the process of streamlining the batch tool and intend to relax this restriction as soon as possible.

Ease of use
We aim to consistently use simple workflows, for example, a threeclick workflow that allows a genomic variant correctly mapping   and (e) VariantValidator generates a series of custom error messages such that users are informed that VariantValidator automatically corrects errors made by the user when it is able to do so, or provide informative information such that the user can correct their own mistakes when VariantValidator is unable to do so. These features allow Vari-antValidator to access and supplement the wide range of tools provided by the hgvs Python package. VariantValidator can, therefore, provide users with a clean, concise and user-friendly Web interface that enables responsive validation of sequence variants.

Mutalyzer features not supported by VariantValidator
Although VariantValidator offers an alternative to Mutalyzer, it does not yet provide the full range of functionality that Mutalyzer currently offers, for example, a HGVS name generator (https://mutalyzer.nl/ name-generator); a description extractor (https://mutalyzer.nl/descr iption-extractor); and a function to convert amino acid substitutions into likely nucleotide substitutions (https://mutalyzer.nl/backtranslator).
Although the hgvs Python package functions allow all common variant types to be parsed into the necessary formats to be handled by its functions, a key strength of the package is its ability to map sequence-level variation between different reference sequences.
In the current build of the hgvs Python package (1.0.0a1), two particular variant types are currently not well supported with respect to mapping. Gene conversions can be validated with respect to sequence-level variation and HGVS compliance. However, they cannot yet be mapped between reference sequences or mapped into theoretical protein sequence variation descriptions. In this respect, VariantValidator is only slightly less capable than Mutalyzer that can validate the syntax of a conversion description (e.g.,

NM_000088
.3:c.4_64conNM_004006.1:c.123_171), but not project the variant to other reference sequence contexts. However, we intend to address this deficiency in a future release of VariantValidator.

Plans for further development
The hgvs Python package and UTA are undergoing continuing development and we may consider expanding VariantValidator to provide support for additional specific types of sequence variation and reference sequence types in the future. Proper future support for inversions might allow us to use native hgvs Python package functions rather than our own custom code. Similarly, support for gene conversions would be a desirable feature. However, the desire to properly support inversions and conversion must be set against the fact that instances of such variant types are relatively rare. We are currently re-developing our batch analysis tools (batch validator and vcf2hgvs) to enhance their performance so that results are returned to our users more quickly.