ClinGen Allele Registry links information about genetic variants

Abstract Effective exchange of information about genetic variants is currently hampered by the lack of readily available globally unique variant identifiers that would enable aggregation of information from different sources. The ClinGen Allele Registry addresses this problem by providing (1) globally unique “canonical” variant identifiers (CAids) on demand, either individually or in large batches; (2) access to variant‐identifying information in a searchable Registry; (3) links to allele‐related records in many commonly used databases; and (4) services for adding links to information about registered variants in external sources. A core element of the Registry is a canonicalization service, implemented using in‐memory sequence alignment‐based index, which groups variant identifiers denoting the same nucleotide variant and assigns unique and dereferenceable CAids. More than 650 million distinct variants are currently registered, including those from gnomAD, ExAC, dbSNP, and ClinVar, including a small number of variants registered by Registry users. The Registry is accessible both via a web interface and programmatically via well‐documented Hypertext Transfer Protocol (HTTP) Representational State Transfer Application Programming Interface (REST‐APIs). For programmatic interoperability, the Registry content is accessible in the JavaScript Object Notation for Linked Data (JSON‐LD) format. We present several use cases and demonstrate how the linked information may provide raw material for reasoning about variant's pathogenicity.


INTRODUCTION
Genome research and genomic medicine both depend on the community's ability to effectively exchange and aggregate information about genetic variants. Our immediate motivation came from the need for unique variant identifiers, particularly for variants not previously registered in ClinVar and for the variants undergoing pathogenicity assessment by the ClinGen-supported expert panels. Assessment of a variant's pathogenicity frequently requires information derived from literature, population sequencing databases, high-throughput experiments, curated databases such as ClinVar (Landrum et al., 2016), and a growing number of other sources. More generally, the increasing pace of data accumulation and the growing diversity of resources and variant nomenclatures are challenging the consumers' ability to effectively Find and Access (the "F" and "A" in "FAIR" standard, respectively; Wilkinson et al., 2016), information about an allele with any certainty that the same allele is being referenced. One key aspect of the problem is the lack of globally unique variant identifiers that would enable the aggregation and connection of information from different sources about the same variant. Non-Single Nucleotide Polymorphism (SNP) variants such as indels may be represented in many different ways, each corresponding to a different human genome variation society (HGVS) expression. Although the problem may be solved in principle by defining "canonical" HGVS expressions and standardizing on a set or reference sequences, practical implementation of this concept is challenging as it requires reconciling "canonical" expressions across a multiplicity of transcript sequences frequently used in clinical genetics.
To overcome the limitations of currently available systems and to address the data aggregation problem at scale, we developed the Clinical Genome resource (ClinGen) Allele Registry. The Registry provides globally unique "canonical" variant identifiers (the "CAids") on demand via web (UI or API) services. A user-friendly web interface provides various ways to query existing and register new variants. The canonical identifiers may be obtained either individually or in high volume via web APIs to meet the registration needs of external databases of any size. The canonicalization service is provided using a custom inmemory index that is based on the alignment of hundreds of thousands of transcripts and genomic sequences. For each canonical identifier, there is typically a multiplicity of HGVS notations for the same variant in the context of common genomic and transcript reference sequences.
The Registry also provides web UIs to support identification and registration of individual variants from partial descriptions present in the literature, genetic test reports, or in other sources.
Another important Registry feature is the support for linking to information about registered variants in external sources frequently used by clinicians, diagnostic laboratories, and researchers. The Reg-istry currently links to major resources such as gnomAD, ExAC (Lek et al., 2016), dbSNP (Sherry et al., 2001), MyVariant.info (Xin et al., 2016), COSMIC (Forbes et al., 2015(Forbes et al., , 2017, and ClinVar (Landrum et al., 2016). Moreover, the Registry also supports on-demand registration of links to additional layers of variant information available from any number of external sources, small or large. The on-demand registration of millions of new variants per request via the APIs is designed to meet the needs of all global genome sequencing efforts that generate germline and somatic variants, of high-throughput in vitro mutagenesis experiments that generate information about functional effects of variants that previously may not have been seen in humans, and even of computational prediction projects that provide information about variants that are yet to be observed in humans or in vitro.
Here, we describe the development and implementation of the ClinGen Allele Registry, its content, and key services. We also summarize sources that use Registry identifiers in their systems. Additionally, we demonstrate both manual access via user-friendly web interfaces and programmatic access via the REST-APIs. Finally, we show how the information linked by the Registry may be "mined" for information about variants' pathogenicity. allele size is 10,000 bp, a cutoff selected based on tradeoff between the need to accommodate large alleles and the need to efficiently store and process them. The Registry also allows for "complex" alleles including haplotypes (e.g., CA033016, a haplotype allele that is also present in ClinVar: NM_000402.4:c.[292G > A;466A > G]); however, it treats each haplotype as a single variant, not modeling individual variants that constitute a haplotype explicitly. The nucleotide (genomic and transcript) and protein variants are treated as different types of entities that may be joined by a one-to-many relationship (every transcript variantcausing at most one amino acid sequence change, whereas each amino acid sequence variant corresponding to possibly one or more transcript variants).

Overall design and data flow
The Registry's backend is implemented in C++ as a multithreaded HTTP server providing publicly available REST-API. The services are highly optimized for query and registration of tens of thousands of The overall data flow of the Registry is summarized in Figure 2.
Input variant descriptions are parsed, validated, and represented internally as series of contextual alleles residing on specific reference sequences (as detailed below in Section 2.2). The normalization step produces a unique variant representation in the context of a specific reference sequence (Section 2.3). The canonicalization step calculates the variant's representation, independent of sequence context, using a sequence-alignment based index (Section 2.4). The Registry sup-ports both the retrieval of previously registered canonical variants and the registration of new variants. To support the data flow and provide these services, the Registry utilizes two internal databases, the reference database consisting of reference nucleotide (genomic and transcript) sequence alignments (Section 2.5) and an allele database (Section 2.6). A key feature of the Registry is its set of tools and services that support the sharing of links to information about registered variants in sources external to the Registry (Section 2.7).

Parsing and validation
The Registry parser takes as input either VCF representations or HGVS expressions representing the following four types of nucleotide variants: single nucleotide variants (SNVs), insertions, deletions, and indels (illustrated in Table 1

Normalization
The normalization step generates a unique variant definitioncorresponding to a HGVS identifier-for each contextual allele. This involves generating on-the-fly all possible HGVS expressions for the variant. A variant representation may sometimes be converted into another equivalent one by "trimming" and "shifting" it right or left by one or more base pairs without changing the resulting alternate sequence (Munz et al., 2015;Tan et al., 2015;Yen et al., 2017). A simple variant definition (after maximal "trimming") that cannot be further shifted left is referred to as "left-aligned"; similarly, a simple variant definition that cannot be shifted right is referred to as "right-aligned." If the "left-aligned" and "right-aligned" simple representations are the same, the variant cannot be shifted. Otherwise, by shifting the variant left and right, multiple equivalent simple variant expressions are generated, as illustrated in Table 2. In either case, the normalization step always identifies the left-aligned simple variant definition-and the corresponding HGVS expression-for the purpose of grouping equivalent contextual alleles during canonicalization.

Canonicalization
Although the normalization step defines unique representation (simple left-aligned) of contextual alleles in the context of a specific reference sequence, the canonicalization step provides a representation for nucleotide variants across multiple references ( Figure 2). Because the canonical representation is by definition "context-independent" (independent of the context of any reference sequence), it is denoted by purely conventional symbolic sign, a canonical identifier ("CAid") and F I G U R E 2 (a) Design and workflow of ClinGen Allele Registry. (b) Screenshot of current core registry-hosted links for a typical variant in the user interface The variant model supports the rare but unavoidable merging and splitting of canonical alleles using the concepts of "Active" and "Inactive" canonical allele, thus achieving absolute persistence of canonical allele URIs. One event triggering the merging and splitting may be the availability of a new human genome assembly where two variants that have distinct identifiers and reside in two different regions in the old assembly are merged in the new assembly. In this case, one of the two identifiers will become inactive. It is important to note that the "inactive" CAid and the corresponding URI continue to be dereferenceable.

Region of alteration
Other events that may trigger merging or splitting include some other changes in reference genome assemblies and changes in alignments.

Reference database
The reference database consists of alignments of reference nucleotide sequences from key genomic databases (RefSeq, ENSEMBL, and LRG) and supports the validation, normalization, and canonicalization steps.
The nucleotide sequences are used as aligned against the latest human genome assembly, currently set to GRCh38. The alignments of ref- The whole reference database currently occupies ∼4 GB of memory.

Allele database
The allele database stores variant definitions and identifiers from major resources (e.g., ClinVar, dbSNP, ExAC, gnomAD). It is composed of a custom low-level key-value database engine with several indices that support fast querying. The database engine fulfills ACID (Atomicity, Consistency, Isolation, Durability) requirements, allowing the Registry to function as an OLTP (On-Line Transaction Processing) system that supports real-time registration of new variants.

Link database
The allele database (Section 2.6) includes variant identifiers from major databases, enabling aggregation of information from these sources. In addition, the Registry provides a service to support ondemand "layering" of additional variant information from any additional source by enabling any party to publish URI links to additional information for any subset of variants, large or small. The URIs point to the variant-specific content that is either human readable (HTML) or machine readable (ideally RDF-serializable JSON-LD) or, ideally, both.
The URIs are constructed on-the-fly using the IETF URI template (RFC 6570) specific to the external information source. The source's URI template is filled using either the variant's CAid (preferred) or the expansion values the source associated with the CAid (Figure 3). To meet the needs of large external sources, the Registry APIs support bulk upload of the associations for already registered variants and for new variants upon their registration.

Availability, licensing, and source code
The Allele Registry services-web app and APIs (https://reg. clinicalgenome.org)-are freely available for public use. The source code is distributed under a GNU Affero GPL v3.0 license and is available at https://github.com/BRL-BCM.

Content of Registry databases
The Registry content is stored in the reference, allele, and link databases (their implementation is described in Sections 2.5-2.7, respectively). The reference database currently contains more than 500,000 reference nucleotide and amino acid sequences (summarized in Figure 4)

Allele Registry supports multiple types of variant query
The Registry web UI (https://reg.clinicalgenome.org) currently offers 11 variant query options, one of which ("HGVS") is illustrated in Allele Registry also supports querying by primary identifiers from multiple key sources, including ClinVar, ExAC, and dbSNP. Query results from the Registry include those primary identifiers, and thus can act as a cross-reference service; for example, ClinVar variation identifiers may be used to locate matching variants in ExAC and vice versa.
Although widely used, HGVS expressions do not provide means for uniquely identifying a variant even in the context of the same reference sequence (Munz et al., 2015;Yen et al., 2017). Some of the current and large part of older literature include incomplete references to variants that include a gene name and par-tial "mutation" description without a transcript or genomic reference sequence identifier. This poses a problem for generating an HGVS expression or variant descriptions in VCF format. This incomplete allele identification is a critical issue as variant classification for rare diseases often relies on data contained in the medical literature (Richards et al., 2015). The Registry provides a web interface that helps identify and register such partially and informally defined variants.
For example, for variant descriptions lacking transcript identifiers, the interface provides all transcripts available when provided a query that finds all possible transcripts given the input of a gene symbol and partial HGVS expression (Figure 5b). The interface also generates variants that are not yet registered. Such variants may be immediately registered by a single click and their CAids or HGVS expressions may subsequently be used for their unambiguous identification, for example when performing variant classification.

The Allele Registry provides rapid and convenient access to new variant identifiers
A query for an individual variant or for multiple variants in the batch mode may identify variants that are well defined but absent from the Registry. In this case, a user has an option to register the variants  interface. An interface dedicated to bulk registration of thousands of variants using as an input a list of HGVS expressions is also available.
Query and registration of millions or more variants per batch are best accomplished via the Registry APIs, as we describe next.

Registry web API provide programmatic interoperability
The use of web APIs is the preferred interoperability method for regular and automated interactions with Registry services. The APIs are indispensable for query and registration of large number (millions) of variants: all variants, even from large resources such as dbSNP and myVariant.info can be queried or registered within 1 or 2 hr (Table 5).
The ClinGen Pathogenicity Calculator (Patel et al., 2017) and To fully support Registry usage by any external application, we implemented the Registry in an API-centric manner with Registry web UIs utilizing Registry functionality exclusively through its public HTTP REST-APIs. Through disciplined adherence to this approach, we ensured that all the functionality accessible to human users via the web interface is also accessible programmatically via the APIs. For maximal ease of use, APIs are designed to communicate using very simple and intuitive endpoints and the response is sent back using a standard Linked Data format (RDF-serializable JSON-LD) or an annotated VCF file format.

Integration of the Registry with variant-centric tools and databases via bidirectional links
Several variant-centric tools and databases currently interoperate with the Registry. ClinGen's Variant Curation Interface, CiVIC (Griffith et al., 2017), myVarinat.info (Xin et al., 2016), and ClinVar register their variants; store the CAids locally within their databases; and provide click-through links to the Registry via the variant URIs embedded in their user interfaces ( Figure 6). To enable linking in the other direction, Registry API services support on-demand linking of variant information from external sources (Figure 3). Any external source may import and manage links (URIs) to variant information that is hosted at their site. This mechanism enables "layering" of additional information about registered variants by the community. The contributed links are accessible both via the Registry web UI and programmatically via the APIs. The links are generated dynamically and point to either user-readable HTML or computer-readable content or, preferably, both. In contrast to the human-readable content that must be aggregated by human inspection, the machine-readable content may be aggregated programmatically (as illustrated in Figure 3) for consumption by computer applications such as variant curation tools.

Variant deduplication
One "side effect" of variant registration is deduplication of variants in the registered source. The canonicalization web service may therefore be used via the APIs to deduplicate variants in both public and private databases. To demonstrate this capability at scale, we employed Allele Registry API to find duplicates in the dbSNP, ClinVar, and MyVariant.Info databases. To accomplish this, for each of these databases bulk queries with all database variants (as VCF files) were submitted using the stepwise process described in Supporting Information Sec- Note for simplicity the likely pathogenic and pathogenic variants were combined as well as likely benign and benign. The full list of variants and assertions is found in Supporting Information Table S1.
curation performed by the NCBI staff of variants submitted to ClinVar before they provide a ClinVar identifier (exact numbers of duplicates in each database are in Table 5).  (Table 6 and Supporting Information Table S1). For example, for TP53 (NM_000546.5), variants c.736AC and c.736AT both result in p.Met246Leu, however, they are interpreted as uncertain significance and likely pathogenic, respectively. This set of variants warrants re-evaluation because variants that result in the same alternate sequence are not likely to show discordant pathogenicity in the absence of human-specific codon bias or alteration in splicing. Similarly, in the ExAC database, we found 32 sets of variants that results in the same alternate sequence where the lowest frequency is less than 1% and the highest frequency is above 5%

Mining linked variant information from ClinVar and ExAC to identify nucleotide variants that cause the same amino acid change while being subject to discordant pathogenicity assertions
(a detailed description is given in Supporting Information Section 1.5), which would result in different classification evidence codes for variant classification by the ACMG/AMP criteria (Richards et al., 2015).
Supporting Information Table S2 summarizes these 32 sets of variants along with their allele frequencies.

DISCUSSION
The The Registry is accessible programmatically via well-documented web APIs in accordance with recently articulated FAIR ("Findable, Accessible, Interoperable, Reusable") principles (Wilkinson et al., 2016). Although the principles were originally defined with large experimental datasets in mind, they also apply more broadly to information and to the computable knowledge about subjects such as genetic variants. The ClinGen Allele Registry addresses multiple aspects of FAIRness, with an emphasis on the following two aspects of "Findability": (a) the requirement for globally unique identifiers and (b) the requirement for rich metadata (including alternate identifiers and identifiable combinations of attributes) to facilitate search and retrieval. The Registry also implements "Accessibility" via HTTP REST-APIs and Interoperability by providing variant information using JSON-LD and controlled vocabularies.
One important Registry feature is the support for linking to information about registered variants in external sources. This approach to data aggregation both parallels and complements traditional data warehousing strategy (Bean & Hegde, 2016), exemplified by databases such as MyVariant.info (Xin et al., 2016) and wANNOVAR, a web server built on top of the ANNOVAR application (Chang & Wang, 2012;Wang, Li, & Hakonarson, 2010;Yang & Wang, 2015). The data warehousing strategy brings all the variant data to a single location through an "Extract-Transform-Load" (ETL) process. A key step in the ETL process is "deduplication" where (a) information gathered from disparate sources is being recognized as pertaining to the same entity (same variant) and can therefore be aggregated, and (b) a locally unique variant identifier ("primary key") is assigned to index the aggregated informa- The Registry is designed for both individual users, such as a clinician or curator using the UI to unambiguously identify a single variant found in an article or test report prior to curation, as well as a genomics pipeline that annotates and/or registers millions of variants through the provided APIs.
Finally, although registry helps to overcome several problems associated with variant identity and canonicalization, it has a few limitations in its current form. First, because the current variant model assumes that variants are identical at the genome and transcript levels, the canonicalization fails when a substitution in the genome also affects splicing, causing inclusion/exclusion of exon in the transcript, described by deletion at a transcript level and a substitution at the genomic level. Second, HGVS expressions for an indel may represent a variation that can be fractioned in the two independent indels (or indels and substitutions). This is a special case of a variant that can be described as a set of variants within a haplotype. In its present form, the registry currently does not explicitly model haplotypes and treats each distinct haplotype as a distinct variant. Finally, the registry also assumes that each variant is described at the base pair level of resolution and does not support variants such as CNVs that may not be described at that level of precision. Continuous development in coordination with key stakeholders are in process toward overcoming these limitations.
In summary, the Registry web services create an innovative nexus for effective exchange and aggregation of information about human genetic variants, thus catalyzing the emergence of a commons of variant data and knowledge required for the advancement of genome research and the genomic medicine.

ACKNOWLEDGMENTS
The ClinGen consortium is funded by the National Human Genome