Human Mutation

Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland Protein Information Resource, Georgetown Medical Center, Washington, DC

proteomic, and clinical communities should greatly inform studies on the functional consequences of variants. Although UniProtKB is being exploited for this purpose to a small extent, for example, PolyPhen-2 (Adzhubei, Jordan, & Sunyaev, 2013) incorporates information on protein active sites from UniProtKB, it is vastly underutilized. PolyPhen and other tools use a variety of structural and sequence conservation information to predict the effects of missense variants and have been incorporated into variant interpretation resources and commercial pipelines (Ioannidis et al., 2016;Kircher et al., 2014;McLaren et al., 2016;Nykamp et al., 2017;Qian et al., 2018;Shihab et al., 2013;Shihab et al., 2015). While these tools work very well for some well-studied genes (e.g., BRCA1, TP53), the results are less established for many others and improvements are needed (Guidugli et al., 2018;Karbassi et al., 2016;Mahmood et al., 2017;Qian et al., 2018;Tavtigian et al., 2018).
Genome browsers (Kent et al., 2002;Yates et al., 2016) provide an interactive graphical representation of genomic data. They utilize standard data file formats, enabling the import and integration of multiple independent studies, as well as an individual userʼs own data, through community track hubs (Raney et al., 2014). Here, we illustrate the utility of representing UniProtKB protein functional annotations at the genomic level via track hubs and demonstrate how this information can be used in combination with genomic annotations to interpret the effect of missense variants in disease-related genes and proteins using specific biological examples and some larger scale comparisons.
Knowledge of variantʼs disease associations is also important in evaluating its impact. Many resources, including UniProtKB and ClinVar (Landrum et al., 2018) provide disease-related information on variants. In UniProtKB, the majority of this information comes from the literature, although OMIM (OMIM, 2018) is also used, primarily as a source for disease names and descriptions and as a means identifying relevant literature. ClinVar is an open database for the deposition of variants identified in clinical genome screens; the scientist submitting variant information is responsible for assigning a clinical significance class to individual variants following the ACMG clinical significance recommendations (Richards et al., 2015). A subset of ClinVarʼs variations, nonsynonymous SNPs that change a single amino acid, closely reflects UniProtKBʼs "Natural variants", which include polymorphisms, variations between strains, isolates, or cultivars and disease-associated mutations (https://www.uniprot.org/help/variant) and are mostly (~98%) single amino acid changes. We evaluated UniProt Natural variant annotation against equivalent annotations in colocated ClinVar SNPs and found significant synergy between the two resources.

| METHODS
Mapping UniProtKB protein sequences to their genes and genomic coordinates are achieved with a four-phase Ensembl import and mapping pipeline. The mapping is currently conducted for the UniProt human reference proteome with the GRCh38 reference sequence and also for Saccharomyces cerevisiae S288C with the sacCer3 reference sequence. Reference sequences are provided by Ensembl. We summarize the approach here. Additional details,

| Phase two: Calculation of UniProt genomic coordinates
Given the UniProt to Ensembl mapping, UniProt imports the genomic coordinates of every gene and the exons within a gene. Included are the 3′-and 5′-UTR offsets in the translation and exon splice phasing.
With this collated coordinate data, UniProt calculates the portion of the protein sequence in each exon and defines the genomic coordinates for the amino acids at the beginning and end of each exon. This set of peptide fragments with exon identifiers and coordinates is stored as the basis for protein to genomic mappings in UniProt.

| Phase four: UniProt BED and BigBed files
Converting protein functional information into its genomic equivalent requires standardized formats. The Browser Extensible Data (BED; UCSC, 2016a), a tab-delimited format, represents one format for displaying UniProtKB protein annotations on a genome browser.
The binary equivalent of the BED file is BigBed (Kent, Zweig, Barber, Hinrichs, & Karolchik, 2010); this format is more flexible in allowing additional data elements, providing a greater opportunity to fully represent protein annotations and is one of the file formats used to make track hubs. A track hub is a web-accessible directory of files that can be displayed in track hub-enabled genome browsers (Raney MCGARVEY ET AL. | 695 et al., 2014). Hubs are useful, as users only need the hub URL to load all the data into the genome browser. Moreover, a public registry for track hubs is now available (https://trackhubregistry.org/) allowing users to search for track hubs in one location and providing links to multiple genome browsers.
Using the protein genomic coordinates, with additional protein feature specific annotations from UniProtKB, the BED detail (UCSC, 2016b) and BigBed formatted files, as well as track hub required files, are produced for the UniProtKB human reference proteome.    (Richards et al., 2015): Benign, Likely benign, Uncertain significance, Likely pathogenic and Pathogenic. There are a small number of additional disease-related assertions in ClinVar such as "risk factor" and "drug response," which we classified as "other" in our analysis. All of the ClinVar assertions in the "other" category that aligned with UniProt annotations were "drug response" assertions. We only used variants with 1-4 stars and removed all 1-star variants with conflicting interpretations and those with no associated phenotype. We equated ClinVar assertions to UniProt classifications as follows: all "pathogenic" assertions (Pathogenic and Likely pathogenic) to "Disease" in UniProt; "Uncertain significance" to "Unclassified"; and, all "benign" (Benign and Likely benign) assertions to "Polymorphism." 3 | RESULTS

| Usage
The BED tab-delimited files are useful to extract genome locations and annotation for data integration and computational analysis similar to that described below for mapping to ClinVar SNPs.
However, we recommend using track hubs and not the BED text files on genome browsers. The extended BED format is not

| Biological examples
To illustrate the utility of combining UniProt protein feature annotations and variation annotations to determine a probable mechanism of action, we looked at two well-studied disease-  In these examples, we looked at the annotation of individual variants manually but, as we illustrate below, our alignment of genome and protein variant annotations can be applied to larger scale analyses.   Table S1), indicating that the resulting disruption of protein structure is very likely to be harmful.
In comparison, variants that co-locate at single carbohydrate/glycosylation sites are tolerated best (less than 10% pathogenic assertions). The two types of features with the next highest proportion of pathogenic SNPs are Initiator methionine and Intramembrane region. Initiator methionine variants alter the initial methionine of a protein sequence, which is believed to result in the loss of protein translation. The Intramembrane region feature describes a sequence of amino acids entirely within a membrane but not crossing it.

| Comparison of ClinVar SNPs and UniProtKB natural variant annotation
To survey genomic and protein annotations on variants we compared Abbreviation: SNP, single nucleotide polymorphism Unclassified and Polymorphism with ClinVar's ACMG/AMP-based assertions of "Pathogenic or Likely pathogenic," "Uncertain Significance" and "Benign or Likely benign." The comparison in Table 2 is a subset of all the colocated SNPs representing 35% of the total variants mapped as all ClinVar SNPs with 0 gold star evidence levels and some 1-star SNPs were excluded (see Methods). The table shows there is general agreement among similar annotations between the databases, with 86% of UniProtKB disease-associated variants mapping to "pathogenic" SNPs in ClinVar and with 10% falling into the middle "Uncertain Significance" category. The remaining 4% fall mainly into the benign category. UniProtʼs "Polymorphism" category is closest in meaning to the "Benign" categories in ClinVar; here, again, there is 85% agreement. For the remaining 15% of "Polymorphism" variants 11% match the "Uncertain Significance" category in ClinVar, 3% are classified as "pathogenic" in ClinVar and 1% as "drug response." UniProtʼs "Unclassified" category is closest in meaning to ClinVarʼs "Uncertain Significance"; these are "gray" areas in each classification system and as such the agreement between the two databases is lowest: only 54% align and the rest are split between "pathogenic" and "benign" in ClinVar. A large number of variants annotated with "Uncertain Significance" status is currently a general problem in the field (Hoffman-Andrews, 2017). In the ACMG/ AMP framework, uncertain occupies a middle ground between benign and pathogenic. Often there is some evidence of a functional defect or harmful effect but it does not rise to clinical relevance or there is conflicting evidence.
Though annotations in UniProt and ClinVar are in general agreement, there is still a significant level of disagreement between the databases, which is similar to that seen in recent analyses that Pharmacogenomic variants may also be classified differently by protein curators and medical geneticists. Our colocated data set in Table 2 has 52 variations with a "drug response" annotation in

| Comparison of literature citations
Positional mappings also allow comparison of literature cited as evidence for the annotated assertions. We compared all PMIDs cited as evidence for the colocated ClinVar and UniProt variants (the same set that was used for Also, some ClinVar citations concern curation methods rather than the specific gene or variation. In UniProt, the missing PMIDs are an error. All the Natural Variants in the Swiss-Prot section were curated from the literature cited in the entry. However, the link to the publications from the amino acid sequence is missing for some older and high throughput publications. Curation and data management practices changed years ago to solve this problem, but not all PMID links have been recovered.
Looking again at our biological example the GLA gene, 28 UniProt and ClinVar variants overlap and 27 of them agree on classification: 26 are classified as "Pathogenic/Likely pathogenic" by ClinVar and Disease-associated by UniProtKB, one as "Unclassified" in UniProt and as "Uncertain significance, drug response" in ClinVar. Of the 28 common variations ClinVar has no citations for ten, and UniProtKB is MCGARVEY ET AL.

| 701
missing citations for two. There is a total of 91 unique PMIDs: 81 from ClinVar and 10 from UniProtKB in the combined GLA set.

| CONCLUSION
Exome sequencing for clinical diagnosis is becoming more common and usually uncovers many non-synonymous SNP variations of unknown significance (VUS). Distinguishing which, if any, of these variants, could be causal is difficult. Protein annotation can aid in variant curation by providing a functional explanation for a variantʼs effect which is one of several important evidence categories used predicting the severity of variants (Nykamp et al., 2017;Richards et al., 2015). An accurate mapping between protein and genome annotation allows for more detailed analysis of the effects of The global comparison of variant classification between Uni-ProtKB and ClinVar in Table 2 and the comparison of literature citations for variants between the two public databases were also informative. There is general agreement on the classification of variants between genome and proteome even though priorities and terminology have been different. However, both comparisons illustrate that the classification of variations in enzymatic activities related to drugs needs better standardization. Many clinically relevant somatic variants found in tumors may need to be handled into a similar manner to "drug response" variants, because they confer sensitivity or resistance to a treatment regime (Boca, Panagiotou, Rao, McGarvey, & Madhavan, 2018;Li et al., 2017;Madhavan et al., 2018;Ritter et al., 2016). Thus, the "pathogenic/ benign" terminology might not be appropriate for all cases.
The work described here provides the basis for a re-evaluation of UniProtKB annotation and the further standardization of this annotation with ClinVar and ClinGen. A detailed evaluation in which UniProt curators are performing a systematic re-curation of a randomly chosen set of variants from UniProt and ClinVar using the ACMG guidelines is being completed (M. Famiglietti et al., ).
Recent efforts in the medical community to standardize the methods and levels of evidence required for the annotation of genetic variants (Amr et al., 2016;Manrai et al., 2016;O'Daniel et al., 2016;Richards et al., 2015;Walsh et al., 2016), along with increasing amounts of population data (Amr et al., 2016;Walsh et al., 2016), are leading to the widespread re-evaluation of previous assertions of pathogenicity.
UniProtKB features have been mapped to the genome before, as the UCSC genome browser has provided selected UniProtKB/Swiss-Prot features for several years. The mappings described here contain additional annotation beyond that previously available and include isoform sequences from Swiss-Prot and sequences and features from the TrEMBL section of UniProtKB. The data files and track hubs will be updated with each release of UniProtKB, making any new annotation available immediately. The 34 features currently provided are not all of the positional annotations in UniProtKB, and we may add additional features in future releases. We plan to extend genome mapping to other model organisms. UniProt is working with the UCSC and Ensembl browser teams to improve the presentation of protein annotation on the respective browsers. In addition, some of the data provided here are available programmatically via a REST API (Nightingale et al., 2017). UniProt also collaborates with ClinVar to provide reciprocal links between variants that exist in both databases.
In summary, linking annotated data with assertions, publications and other evidence from UniProtKB, ClinVar or other datasets via co-location on the genome, as we demonstrate here, should help to better integrate protein and genomic analyses and improve interoperability between the genomic and proteomic communities to better determine the functional effects of genome variation on proteins. The location of a variant within functional features may correlate with pathogenicity and would be a useful attribute for use in variant prediction algorithms, including machine-learning approaches. We hope to investigate this and related topics in the future, and as a publicly funded resource, UniProt encourages others to further analyze our data as well.

| Data access
The extended BED text files and binary BigBed files used for genome edu/cgi-bin/hgHubConnect?hubSearchTerms=uniprot) and the Ensembl genome browser (Aken et al., 2016;Hubbard et al., 2007)

CONFLICT OF INTEREST
The authors declare that there is no conflict of interest.

ETHICAL COMPLIANCE
Patient clinical data have been obtained in a manner conforming with IRB and/or granting agency ethical guidelines.