hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update

Abstract The Human Genome Variation Society (HGVS) nomenclature guidelines encourage the accurate and standard description of DNA, RNA, and protein sequence variants in public variant databases and the scientific literature. Inconsistent application of the HGVS guidelines can lead to misinterpretation of variants in clinical settings. Reliable software tools are essential to ensure consistent application of the HGVS guidelines when reporting and interpreting variants. We present the hgvs Python package, a comprehensive tool for manipulating sequence variants according to the HGVS nomenclature guidelines. Distinguishing features of the hgvs package include: (1) parsing, formatting, validating, and normalizing variants on genome, transcript, and protein sequences; (2) projecting variants between aligned sequences, including those with gapped alignments; (3) flexible installation using remote or local data (fully local installations eliminate network dependencies); (4) extensive automated tests; and (5) open source development by a community from eight organizations worldwide. This report summarizes recent and significant updates to the hgvs package since its original release in 2014, and presents results of extensive validation using clinical relevant variants from ClinVar and HGMD.


INTRODUCTION
The standardized representation of genomic, transcript and protein sequence variants is essential in biomedical research and clinical genetics. Accurate interpretation of sequence variants in genetic tests-and, therefore, the resulting patient diagnosis-depends on variants being described, communicated, and compared using consistent representations. The Human Genome Variation Society (HGVS) nomenclature guidelines, first proposed in 1998 (Antonarakis, 1998;den Dunnen & Antonarakis, 2000), have become the de facto international standard for reporting sequence variants (Li et al., 2017;Richards et al., 2015). The guidelines are widely employed in public databases (Fokkema et al., 2011;Landrum et al., 2017), and tools (Cingolani et al., 2012;McLaren et al., 2016;Wang, Li, & Hakonarson, This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. With the widespread adoption of high-throughput sequencing and the complexity of DNA, RNA, and protein variants, the HGVS nomenclature has continued to evolve . Manually generated HGVS representations are prone to applying HGVS nomenclature guidelines incompletely or incorrectly, resulting in malformed representations, incorrect reference bases or incorrect normalization as required by the HGVS nomenclature (Deans, Fairley, den Dunnen, & Clark, 2016;Tack et al., 2016). To facilitate these demands, specialized tools for manipulating HGVS representations of variants according to the HGVS nomenclature guidelines have been developed, including Mutalyzer (Wildeman, van Ophuizen, den Dunnen, & Taschner, 2008) and the hgvs package (Hart et al., 2014). Mutalyzer is a web-based service for checking the

Package overview
The hgvs package is composed of five major modules ( Figure 1):

Key changes since the original release 2.2.1 Parser
The parser in the hgvs package is based on a parsing expression grammar (Hart et al., 2014). In hgvs 1.0, grammar rules were added to support parsing inversion, conversion, and identity variants, in addition to existing support for substitution, deletion-insertion, insertion, deletion, duplication, and repeated sequences.

Validator
The validator module ensures that a variant is semantically valid and adheres to HGVS guidelines. Validation is performed in two stages: intrinsic validation, which ensures that a variant is internally consistent, and extrinsic validation, which uses external data to ver- Because extrinsic validation requires external data, it is more computationally expensive and therefore performed after intrinsic validation.
The validation mechanism was significantly refactored in hgvs 1.0.
Validators consist of sets of validation criteria that are invoked for a specified variant. Validation criteria now return one of three validation response levels: VALID when all criteria are satisfied, ERROR when a criterion is violated, or WARNING when a criterion cannot be evaluated (discussed below). Validators always raise an exception when any of the validation criteria return ERROR. In addition, validators support a strict mode in which an exception is raised when a criterion returns WARNING.
For example, the extrinsic validator includes a criterion that verifies the agreement of the reference sequence provided in a variant with the sequence implied by the accession and variant location. When these sequences match, the criterion is satisfied (returns VALID). When the sequences do not match, the criterion is violated (returns ERROR).
However, an important third case exists: when the variant refers to intronic sequence, which cannot be validated or refuted, the criterion returns WARNING (and an appropriate message). In the default mode, the extrinsic validator would record the WARNING but not raise an exception; in the strict mode, the extrinsic validator would raise an exception. In this way, the hgvs package enables users to distinguish variants that are unambiguously valid, plausibly valid, and unambiguously invalid. with variants in border cases, such as variants located at exon-intron boundaries ( Figure 2B).

Formatter
Variant formatting converts an internal object representation into a conventional HGVS textual form. The upgraded hgvs package enables software developers to specify how a variant is formatted (Table 2).
Users can specify the maximum reference length to be displayed for deletions. For large deletions that exceed the maximum display sequence length, the reference sequence is omitted from the display.
For the formatting of protein variants, it is configurable to use oneletter or three-letter (default) representations of amino acids. In addition, stop codons in three-letter representations may be represented by Ter (default) or * .

Projection (Mapping)
The variant mapper in hgvs package projects (maps) sequence variants between aligned sequences and predicts the protein level changes with respect to transcript-level variation. Alignments between transcript and genome sequences often contain sequence discrepancies, including indels, due to sequencing errors in databases and natural sequence variation in populations. A distinguishing feature of the hgvs package is its ability to correctly account for indels between transcript and genome sequences. This ability is critical to accurately interpreting variants in many clinically significant genes (Kalia et al., 2017).
The AssemblyMapper module was added in hgvs 1.0 to significantly streamline projecting variants between genome, transcript, and protein sequences within a single assembly. The module supports any assembly from the NCBI Assembly resource, provided that corresponding genome-transcript alignment data are available.  Figure 3). Local installation also allows sites to precisely control deployed versions of software and be assured that no patient data are exposed externally. The hgvs privacy statement at https://hgvs. readthedocs.io/en/stable/privacy.html provides details about data that are and are not collected when using public services.

Effect of local UTA and SeqRepo instances
We evaluated and compared the running time of validation, normalization, and mapping for 100,000 transcript variants in ClinVar (Landrum et al., 2017), using local and remote instances of UTA and local (SeqRepo) and remote sequence sources. The evaluations were run on the same Amazon EC2 m4.xlarge computing instance. Results showed that using local UTA and local SeqRepo could accelerate the validation process 53-fold, accelerate the normalization process 39fold and accelerate the mapping process 34-fold, compared to that using remote UTA and sequence data sources (Figure 3).

Parsing and validating ClinVar and HGMD variants
To demonstrate the robustness of the upgraded hgvs package, we applied it to batch analyzing transcript variants and genomic variants from ClinVar, which is a trusted large-scale repository for clinically relevant variants (Landrum et al., 2017

Normalizing ClinVar and HGMD variants
Given the utility of ClinVar and HGMD as resources for clinically relevant variants, standard and uniform representation of variants in ClinVar and HGMD according to the HGVS nomenclature guidelines is critical for the identification and interpretation of disease-related variants. We first utilized the hgvs normalizer to standardize the represen-

Round-Trip projection of ClinVar variants
To test the fidelity of the ability of the hgvs package to project variants between sequence alignments, we undertook "roundtripping" tests in which an original variant was projected from one sequence to another and back; the expectation is that the original and resulting variants should be identical.
In the first test, we projected genomic variants in ClinVar to tran- transcript variants in ClinVar were the same as the cross-mapping generated transcript variants produced by hgvs. this section, we elaborate on the origin of those differences with specific cases. Table 5 summarizes differences between the two packages that affect the accuracy of variant manipulation.

Indel-aware alignment
Due to polymorphisms and sequencing errors, a small number of transcript-genome alignments contain substitutions or indels. As of

Validating variants before projection
The hgvs package validates variants before projection in order to ensure that algorithms are applied in appropriate contexts.
For example, it refuses to project variants with invalid coordinates. When projecting NM_003002.3:c.500000G>T to genomic sequence, hgvs will raise an error signaling that the nucleotide coordinate is out of bounds. However, Mutalyzer will project this variant to NC_000011.9:g.112465214G>T, nearly 500 MB from the transcript.
Projecting variants in the vicinity of sequence substitutions and indels is fraught with many challenges. The variant mapper in the new hgvs package is designed to deal with possible cases when projecting variants located at transcript-genome alignment gaps and substitutions. Extensive tests demonstrate the new hgvs package could correctly project such variants between transcript and genome sequences. Table 6 summarizes hgvs results of projecting variants that are within, exactly cover, partially cover and extend beyond the bounds of the transcript-genome alignment gaps.

Updating variant reference sequence after projection
When there are substitution differences between transcript and genomic sequence, the variant reference sequence must be updated to reflect the correct sequence. For example, NM_000024.5:c.46 corresponds to NC_000005.9:g.148206440, the site of a known SNP (rs1042713). The reference nucleotides in the transcript and genomic sequence are A and G, respectively. When hgvs projects NM_000024.5:c.46A>T to NC_000005.9:g.148206440G>T, it has replaced the reference A with G. Mutalyzer returns NC_000005.9: g.148206440A>T, which is invalid.

Normalizing variants after projection
When projecting variants between sequences, it is necessary in some circumstances to renormalize the variant, especially for transcripts on the minus strand because 3 ′ normalization will operate in the opposite direction to the genomic sequence (plus strand has the same net effect as inserting CGTAC after 32417917 (NC_000011.9:g.32417913_32417917dup).

Rewriting variants in preferred forms
In addition to 3 ′ shifting, the normalizer algorithm in hgvs rewrites variants according to this priority scheme: substitution > deletion > inversion > duplication > conversion > insertion > deletion-insertion.

Additional differences
A broader comparison of features of hgvs and Mutalyzer is summarized in Supporting Information Table S1.

DISCUSSION
We The hgvs package is a robust tool for working with HGVS variants.
By making hgvs freely available for commercial and noncommercial uses, and by providing support for fully local installations, we have provided a flexible, clinical-grade toolkit that contributes to the accurate interpretation of variants for patients and the consistent description of HGVS variants in public databases.

ACKNOWLEDGMENTS
The hgvs