Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor—A tutorial

The Ensembl Variant Effect Predictor (VEP) is a freely available, open-source tool for the annotation and filtering of genomic variants. It predicts variant molecular consequences using the Ensembl/GENCODE or RefSeq gene sets. It also reports phenotype associations from databases such as ClinVar, allele frequencies from studies including gnomAD, and predictions of deleteriousness from tools such as Sorting Intolerant From Tolerant and Combined Annotation Dependent Depletion. Ensembl VEP includes filtering options to customize variant prioritization. It is well supported and updated roughly quarterly to incorporate the latest gene, variant, and phenotype association information. Ensembl VEP analysis can be performed using a highly configurable, extensible command-line tool, a Representational State Transfer application programming interface, and a user-friendly web interface. These access methods are designed to suit different levels of bioinformatics experience and meet different needs in terms of data size, visualization, and flexibility. In this tutorial, we will describe performing variant annotation using the Ensembl VEP web tool, which enables sophisticated analysis through a simple interface.

application of predictive algorithms to evaluate the impact of change at the locus. Appropriate resources are now available to facilitate variant interpretation and include: reference gene sets that are regularly updated; assertions of genotype-phenotype association continue to grow in the key databases and literature; population frequency studies that are expanding to include more individuals and report more detailed catalogs of rare variants, and variant pathogenicity prediction, which is an active area of tool development.
In the Ensembl Project (Howe et al., 2021) we create high-quality gene sets, predict genomic regions involved in gene regulation, and collate large-scale sets of variant and phenotype association data.
Ensembl VEP (McLaren et al., 2016) builds on these resources and integrates results from variant assessment algorithms to enable convenient but extensive variant annotation. We provide regular updates, approximately every 3 months, to both the VEP software and associated data to ensure the latest information can be used for analysis. Here we present a tutorial describing the Ensembl VEP web interface, detailing the available analyses options and filters.

| DATA INPUT
Navigate to the Ensembl VEP homepage by clicking on the "VEP" link in the blue navigation bar on the Ensembl homepage (https://www. ensembl.org/index.html). The Ensembl VEP homepage links to the three different VEP interfaces and detailed documentation. Click on "Launch VEP" to open the web form, which is divided into sections for data input and optional analysis configuration (Figure 1).
The human GRCh38 assembly is selected by default, but a link provides access to a GRCh37 dedicated tool. To make the management of multiple analyses simpler, a name can be assigned to each job.
F I G U R E 1 The Ensembl VEP web interface showing species/assembly selection, data input, transcript set selection, and additional groups of configuration options Data can be input by (1) pasting into the text box, (2) uploading a file, or (3) by providing a URL for a file on a public server. The text box is suitable for small-scale datasets. To analyze a larger data set, provide a URL or use the file upload option which supports a maximum file size of 50 megabytes (or around two million lines in a compressed variant call format [VCF] file).
VCF is the standard exchange format used in next-generation sequencing pipelines so Ensembl VEP is optimized to analyze variants in this format.
Further options for selecting the reference transcript set as well as retrieving additional annotations including related identifiers, allele frequencies, pathogenicity predictions, and phenotype annotations can be found in the expandable panels and will be explored in more detail below.

| TRANSCRIPT SET SELECTION
Predicting the molecular consequence of a genomic variant is an essential step in interpretation and requires extensive, accurate gene annotation. There are two commonly used human gene sets: Ensembl/GENCODE (Frankish et al., 2021) and RefSeq (O'Leary et al., 2016). Both sets are generated using similar but slightly different evidence and algorithms, and so differ slightly. VEP can analyze variants using either gene set, or the combined group, or GENCODE Basic (which contains a small subset of representative transcripts for each gene). Select your preference in the "Transcript database to use" section ( Figure 1).
The Ensembl VEP algorithm compares each variant to each transcript in the selected set and reports the relative transcript location of the variant (e.g., exonic, upstream) with any predicted molecular consequence (e.g., missense, frameshift). Consequences are described using Sequence Ontology terms (SO;Cunningham et al., 2015) to enable comparison and integration with results from other systems.

| Transcript-related identifiers
Gene symbols assigned by the HUGO (Human Genome Organisation) Gene Nomenclature Committee (HGNC), versioned transcript accessions, and transcript types (e.g., AGT, ENST00000366667.6, protein-coding, respectively) are returned by default. Use the "Identifiers" section ( Figure 2) to add further information, including Ensembl or RefSeq protein identifiers, UniProt protein accessions, and HGVS variant descriptions at protein and transcript level to your output.

| Frequencies and citations
With over seven hundred million variants in dbSNP (version 154, May 2020) alone, the majority of variants found in an individual will have already been described. This information can be crucial to F I G U R E 2 The "Identifiers" section, which allows the selection of gene, protein, and HGVS identifiers HUNT ET AL. For fully open databases, variants are matched by allele, and key additional information is reported. By default, we only report matches to variants passing our quality filtering (e.g., those mapping to multiple genomic locations are excluded); to include all variants in the search check the "Include flagged variants" option.
In rare disease studies, it is useful to filter out variants using reference population frequencies, as variants common in the general population are less likely to be causative. Use the "Variants and The American College of Medical Genetics and Genomics (ACMG) guidelines (Richards et al., 2015) use 5% allele frequency as stand-alone evidence a variant allele is not pathogenic. For a single causative variant, ACMG recommends frequency filters should be selected to be higher than disease prevalence. Filter cut-offs should be higher if it is possible multiple variants are acting together.
Select the "Variant synonyms" option to display the names of variants in databases such as ClinVar, UniProt, and PharmGKB. In your results, the names will be linked to the relevant entries in the source databases, so the details held in these resources can be examined. Check the "PubMed identifiers" button to return a list of any publications describing the variant with links to full-text resources where available. Variant citations are imported from a number of sources including manually curated records. These can contain occasional incorrect results which cannot be filtered out computationally, though errors are usually obvious on review. Citation and synonym information is matched on variant name or location and is not allele-specific.

| Transcript selection
Transcriptomic sequencing from multiple tissues has resulted in the annotation of increasing numbers of transcript isoforms for many genes. Assessing large numbers of predictions for each variant is time-consuming but important to ensure no information is missed. To support downstream filtering VEP reports transcript type (such as protein-coding or pseudogene) and, for Ensembl transcripts, two prioritization metrics. Transcript Support Level (TSL) summarises the amount of evidence supporting a transcript into a numeric score.
APPRIS (Rodriguez et al., 2018) identifies principal transcript isoforms for genes in vertebrate species using protein structural information, functionally important residues, and evidence from cross-species alignments. These options are listed in the "Transcript annotation" section and are reported in Ensembl VEP results by default ( Figure 4).

| Protein domains
When a variant maps to the protein, understanding which domain it falls in can provide clues as to its possible impact on function. In-terPro is an integrated resource for protein families, domains, and sites, combining information from several different protein signature databases (Blum et al., 2021). We run InterProScan

| Regulatory elements
Variants in the noncoding regions of the genome are more difficult to interpret than those falling within genes and are also important in F I G U R E 4 The "Additional annotations" section, which allows the selection of transcript, protein domain, regulatory region, and phenotype annotations HUNT ET AL. | 5 disease (Zhang & Lupski, 2015). In the Ensembl project, we use data from large-scale projects including Encyclopedia of DNA Elements (ENCODE), the International Human Epigenome Consortium (IHEC), and Blueprint, to predict regions in the human genome that influence gene regulation. We classify them into types such as "promoter" and "enhancer" (Zerbino et al., 2015). Select the "Get regulatory region consequences" option ( Figure 4) to identify where your variants overlap such regions. This analysis can be configured to report all results or only those from specific cell types.

| Phenotype and disease associations
Access to phenotype or disease associations previously reported for your variants or the genes they overlap is essential. There is a large body of information available in different databases but which is free to access, with registration.

| Prediction packages
An increasing number of pathogenicity scoring algorithms are being developed to aid variant interpretation. It must, however, be remembered that predictions often use the same training sets and/or evidence so agreement between two algorithms does not necessarily provide additional evidence for a rating. We calculate scores for all possible amino acid substitutions in all Ensembl proteins using SIFT (Kumar et al., 2009) and PolyPhen-2 (Adzhubei et al., 2010). These results are returned by default. dbNSFP, the database for nonsynonymous SNPs' functional predictions (Liu et al., 2020) contains pre-calculated scores for over 20 algorithms. Select this option ( Figure 5), to browse the "Fields to include" menu and configure the precise results set to be returned.
CADD (Rentzsch et al., 2019) is a framework for scoring the deleteriousness of genomic variants using a wide range of different information including conservation, functional information, and protein level pathogenicity predictions. Select this option to view scores for variants in both coding and noncoding loci.
Variants that disrupt splicing have also been implicated in human disease (Ward & Cooper, 2010). We optionally report results from the well-established MaxEntScan (Yeo & Burge, 2004); SpliceAI (Jaganathan et al., 2019), which takes a machine learning approach; and the ensemble scores provided in the dbscSNV (Liu et al., 2020) database. Select these options in the "Splicing predictions" section ( Figure 5).
Considering sequence constraint and conservation can help interpret how well a sequence change at a particular location may be tolerated. Catalogs of variants in dense population samples have enabled improved estimation of selection against changes that result in loss of function for a gene. Scores from one such algorithm, LoFtool (Fadista et al., 2016), are available. We also optionally report BLOSUM62 (Henikoff & Henikoff, 1992) scores for missense variants as a classic measure of the impact of changing one amino acid for another. In Ensembl, we infer genome-wide ancestral sequences (Paten et al., 2008) for different groups of species. Select the "Ancestral Allele" option ( Figure 5) to obtain the ancestral allele predicted from the alignment of 12 primate species, including homo sapiens.

| Filtering and advanced options
The options in these sections will not be required for the majority of analyses. The "Filters" section ( Figure 6) allows the results returned to be restricted by allele frequency, to contain only variants in the coding sequence, or to be reduced to a subset of the available variant-transcript combinations. However, we recommend instead filtering results after the analysis, which allows for greater flexibility.
The "Advanced options" allow you to change the way VEP analyses variants internally (a smaller batch size will reduce memory requirements but increase run time) and control whether insertion and deletions in repetitive sequence are expressed at their most 3' position before consequence evaluation.

| RESULTS
Having configured your analysis, click the "Run" button at the bottom of the form. Analysis jobs run on our compute farm and the time required will depend on the number of input variants and range of options chosen. The "Recent jobs" table displays the status of all your analyses and has options to edit and resubmit, share or discard jobs.
Results can be saved by logging into an Ensembl account. Once a job has the status of "Done," clicking on "View Results" will display the results table.
Summary statistics and charts display an overview of the results on the output page (Figure 7). There is also a table with a preview of the detailed results and a simple interface to configure filtering of the output. To aid variant prioritization, multiple filters can be combined using basic logical relationships, allowing the creation of complex customized queries. For example "Consequence is protein_alter-ing_variant" plus "CADD PHRED >=30" plus "gnomAD AF is not defined" will report variants which are predicted to change protein sequence, are in the 0.1% most deleterious changes predicted by CADD and are not seen in the gnomAD exome variant set. Importantly, we report the most specific SO term but enable querying by parent terms. For example, when the consequence of "protein-

| Structural variants
Ensembl VEP is currently optimized for the annotation of short variants, however basic annotation of structural variants (SVs), F I G U R E 5 The "Predictions" section, which allows the selection of different pathogenicity, splicing, and conservation predictions HUNT ET AL. | 7 with defined boundaries within a chromosome, is supported. SVs should be input using VCF format and the length of the variant must be derivable from either "END" or "SVLEN" keys. All transcripts an SV overlaps are reported and SO consequence terms are assigned to report whether the variant results in the deletion or duplication of part or all of the transcript. Due to the longer genomic regions involved, analysis of SVs is slower and more memory-intensive than for short variants and it is advisable to reduce buffer size using the "Advanced Options" for more efficient analysis.

| CASE STUDY
A region of chromosome 11 has been sequenced for a number of samples. A VCF file containing information about the identified variants has been produced through a process of alignment and variant calling using the GRCh38 reference genome assembly. Ensembl VEP has been used to annotate the variants with information about allele frequencies reported in the 1000 Genomes project and phenotype annotations. The input VCF file and the output files in VCF and VEP format can be found in the supplementary materials.
To produce these output files, run the web-based VEP tool using the provided input data with default settings as well as selecting the options to retrieve the 1000 Genomes continental allele frequencies and overlapping phenotype annotations.
When viewing the Ensembl VEP output in the web browser, the summary table shows that the data set contains 15 variants that overlap three genes and one regulatory feature.
In the full results table, the first set of columns reports information about the variants and the features they overlap. Where the feature is a transcript, you will find the gene symbol and stable ID and the transcript stable ID and biotype. In this data set, all variants overlap the transcripts of the HBB gene.
The predicted effects on transcripts can be found in subsequent columns, including the position of the variant in terms of the exon number, cDNA, CDS and protein, the amino acid and codon change, as well as transcript flags, such as MANE, which can be used in variant prioritization and reporting, and pathogenicity scores. The pathogenicity scores are shown as numbers with colored highlights to indicate the prediction.
Where the variant is known, its identifier is listed in the "Existing variant" column, with a link out to the variant page within Ensembl. In this example, identifiers from dbSNP, COSMIC, and/or HGMD can be found for each variant.
By default, Ensembl VEP also reports the 1000 Genomes project global allele frequency (AF in the table). In our query, we also selected the option to retrieve allele frequencies from the 1000 Genomes continental populations and these data are displayed in separate columns.
To illustrate filtering, we will look for variants that are not rare in a 1000 Genomes population but do have associations with disease reported in ClinVar. Find variants that are observed at a frequency of 5% or greater in the African continental population of the 1000 Genomes project by selecting a filter of "AFR AF >0.05" and clicking "Add." Find variants with information in ClinVar by entering a filter of 'Clinical Significance is defined' and clicking "Add." In this query, the variant with ID rs334 has an alternative allele frequency of 0.0998 in the African continental population, 0.0072 in the American continental population, but 0 in the East Asian, South Asian, and European continental populations (Figure 9). In subsequent columns, the ClinVar clinical significance and the phenotypes associated with known variants or with the genes affected by the variants are reported. In this query, the variant with ID rs334 is described as both protective and pathogenic as well as being associated with 39 phenotypes including malaria, anemia, and beta-thalassemia ( Figure 10).

CONFLICT OF INTERESTS
Paul Flicek is a member of the scientific advisory boards of Fabric Genomics, Inc., and Eagle Genomics, Ltd.

DATA AVAILABILITY STATEMENT
No new data were created or analyzed in this study.
Publicly available data is integrated into the Ensembl variation resources. Reference data packaged for use in Ensembl VEP is available from our FTP site in release-specific directories for example: http://ftp.ensembl.org/pub/release-103/variation/vep/.

WEB RESOURCES
The Ensembl VEP command-line tool is available from https://github.