A standardised nomenclature for long non‐coding RNAs

Abstract The HUGO Gene Nomenclature Committee (HGNC) is the sole group with the authority to approve symbols for human genes, including long non‐coding RNA (lncRNA) genes. Use of approved symbols ensures that publications and biomedical databases are easily searchable and reduces the risks of confusion that can be caused by using the same symbol to refer to different genes or using many different symbols for the same gene. Here, we describe how the HGNC names lncRNA genes and review the nomenclature of the seven lncRNA genes most mentioned in the scientific literature.


| INTRODUCTION
The HUGO (Human Genome Organisation) Gene Nomenclature Committee (HGNC) is the only group with the official capacity to name human genes. We name protein-coding genes, pseudogenes and non-coding RNA (ncRNA) genes; our commentary on our latest nomenclature guidelines 1
The naming of lncRNA genes is currently the main focus of our ncRNA naming work, in part due to the large numbers of these genes annotated in the human genome, and in part due to the many papers being published on the lncRNAs encoded by these genes. LncRNA genes are the only class of human genes, other than protein-coding (pc) genes, where research groups may suggest a symbol based on a function or important characteristic of the gene. The HGNC encourages research groups to contact us prior to publication to ensure that proposed symbols meet with HGNC guidelines. 1 Briefly, new human gene symbols should not clash with existing vertebrate gene symbols, commonly used abbreviations, or common English words; symbols should contain only uppercase Latin letters and Arabic numerals; symbols should not contain references to any species; symbols must not be pejorative or offensive. The use of punctuation is avoided although hyphens may be used in specific cases. Unique symbols have always been important to aid literature searching but are now more necessary than ever with the advent of text mining. HGNC curators search the scientific literature for papers on lncRNA genes; where published symbols do not fulfil HGNC guidelines we contact authors to discuss suitable alternatives. For this reason, we approved the unique symbol CHROMR, "cholesterol induced regulator of metabolism RNA," for the lncRNA gene first published as CHROME 3 and EMSLR, "E2F1 mRNA stabilising lncRNA," for the lncRNA first published as EMS. 4 Both CHROME and EMS are poor search terms, and "chrome" is a widely used English word.
The HGNC has been naming lncRNA genes since the early 1990s but it is within the last decade that this endeavour has taken up a large proportion of our gene naming effort. HGNC approved lncRNA gene symbols are displayed in relevant biomedical resources such as Ensembl, 5 NCBI Gene, 6 RNAcentral, 7 LNCipedia, 8 OMIM 9 and GeneCards. 10 The HGNC provides a Symbol Report on our website (genenames.org) for each gene with an approved symbol that features links out to these and other relevant biomedical resources; Figure 1 shows an example Symbol Report for XIST. Where there is a mouse ortholog, we provide a link to the relevant page of the Mouse Gene Database. 11 Figure 2a demonstrates how rapidly the number of publications has increased with time for the seven most widely published lncRNA genes. We have provided a summary of the nomenclature of each of these seven lncRNAs below. These examples illustrate many of the typical issues we consider while naming genes.

| XIST
The XIST (X inactive specific transcript) gene was first published in 1991 12 and the symbol was approved by the HGNC in the same year. As of April 2022, there were over 1,900 hits in PubMed for the XIST symbol ( Figure 2a) with no other competing gene symbols in general use and no overlapping use of the abbreviation to refer to different concepts. XIST is conserved in eutherians and contains two exons derived from a pseudogene that has a coding ortholog from the LNX (ligand of numb-protein X) family, published as Lnx3, at a conserved position in birds, reptiles and amphibians. However, the majority of XIST exons contain sequence derived from mobile elements that is completely unrelated to the pseudogene. 13,14 XIST is necessary for inactivation of one X chromosome in cells with two copies of this chromosome; please see 15 for a recent review on the mechanisms by which XIST achieves this. Notably, the XIST sequence element known as "Repeat A" that has been shown to be necessary for gene silencing is not located within the pseudogene-derived sequence. 14

| H19
The H19 symbol was approved by the HGNC in April 1994 based on 16 who stated that "Despite the fact that it is transcribed by RNA polymerase II and is spliced and polyadenylated, we suggest that the H19 RNA is not a classical mRNA. Instead, the product of this unusual gene may be an RNA molecule." The H19 symbol is also approved for the mouse and rat orthologs; in all three species this lncRNA gene shows sequence similarity and hosts the microRNA gene MIR675 in an exon. The symbol H19 should be viewed as historical as it does not represent a characteristic or function of the gene; this is an example of a gene symbol that the HGNC will retain as it is supported by the lncRNA community and widely published ( Figure 2a). H19 originates from a paper on mouse fetal-specific hepatic mRNAs and the assumption is that the "H" stood for hepatic although this is not explicitly stated; this paper already commented that H19 is expressed in heart and skeletal muscle. 17 The original HGNC-approved gene name that accompanied the H19 symbol was "H19, imprinted maternally expressed untranslated mRNA" but this has since been updated to "H19 imprinted maternally expressed transcript" because the term mRNA is now used only for genes that produce transcripts which are translated into protein. H19 is expressed in the foetus and placenta; the current approved name reflects the fact that this imprinted gene is expressed from the maternal allele. This is in contrast with the neighbouring protein coding gene IGF2, which is also highly expressed in the placenta but is expressed from the paternal allele. 18 H19 is found in some adult tissues such as skeletal muscle and the adrenal gland, and its dysregulation has been associated with many types of cancer although there are contrasting theories about its involvement in the progression of these cancers. 19

| MEG3
MEG3 (maternally expressed gene 3) is another maternally imprinted lncRNA gene. This gene was originally approved with the symbol GTL2 (gene trap locus 2) based on the identification of the mouse ortholog from the site of a gene trap integration. 20 It was subsequently renamed to MEG3 to be grouped with other maternally imprinted genes using the MEG# root symbol 21 in mouse and human -MEG8 and MEG9 are approved symbols for other lncRNA genes. Like H19, MEG3 has been associated with many types of cancer and has been reported to be a tumour suppressor gene via regulation of TP53, 22 by separate regulation of RB1, 23 and by suppression of angiogenesis. 24 Figure 2b shows usage of GTL2 versus MEG3 over time and shows how MEG3 is now the symbol supported by the lncRNA community.
F I G U R E 1 An example Symbol Report for the lncRNA gene XIST from genenames.org. HGNC Symbol Reports present the HGNCapproved gene symbol, gene name, unique HGNC ID and other manually curated data in the top HGNC data section. The "Stable symbol" luggage tag is shown at the top of the report for approved symbols which are unlikely to ever be changed. Further down the report, links to many different biomedical resources are provided. Here, we have highlighted the resources that are particularly relevant to lncRNAs The GTL2 symbol has been retained in the MEG3 entry as a "previous symbol" in line with HGNC's normal practise of retaining all previously approved gene symbols.

| HOTAIR
HOTAIR (HOX transcript antisense RNA), which lies antisense to the protein coding HOXC11 gene, was The number of publications in PubMed for the top seven most highly published lncRNA genes. (a) For each of the seven highly published lncRNA genes, the number of publications has rapidly increased over the last 5 years. (b) For all of the most highly published lncRNA genes, the majority of publications use the current HGNC approved symbol. The first chart shows how over time the number of publications supporting the approved symbol MEG3 have increased compared to the previous symbol GTL2. The second chart shows NEAT1 and its published aliases (VINC, MENbeta, MENepsilon, TncRNA); the usage of NEAT1 far surpasses any of its aliases within the last decade. The third chart compares usage of the approved symbol MALAT1 and its published alias NEAT2; again MALAT1 is highly supported. The other four most highly published lncRNA symbols have negligible numbers of publications that do not use the approved symbol approved in 2007 based on. 25 This lncRNA was initially reported to regulate genes at the HOXD locus. It has since been reported as positively regulating HOXC11 levels in cis and negatively regulating HOXD in trans, perhaps due to a duplicated noncoding element within the HOTAIR gene and HOXD locus. 26 This lncRNA has also been associated with many types of cancer. 27 HOTAIR has a mouse ortholog named Hotair, and mouse models have been reported with contrasting phenotypes. 28 We now have a more systematic way of reporting genes that are antisense to protein coding genes (see the "Systematic protocol" section below), and the symbol "HOTAIR" could be considered somewhat frivolous which we avoid where possible, but we will retain the HOTAIR symbol due to overwhelming usage.

| NEAT1
Two transcripts produced by the NEAT1 gene were first published as MENbeta and MENepsilon in a paper about the transcript map surrounding the MEN1 locus, 29 but these two transcripts were not further characterised at that time. The NEAT1 gene is over 620 kb downstream from the MEN1 gene, with many intervening protein coding genes between these two loci, and it has not been associated with the MEN1 gene functionally, so a symbol linking this gene to MEN1 is not optimal. A short transcript from the NEAT1 locus was described as "trophoblast noncoding RNA" (TncRNA) 30 but this isoform is not found in the mouse ortholog (Neat1) and "TncRNA" is not unique as it is also used as an abbreviation for both "telomeric ncRNA" and "tiny ncRNA" so would not be a suitable gene symbol. Additionally, the longer isoforms of NEAT1 are widely expressed so nomenclature linking this gene specifically to the trophoblast would be misleading. The symbol NEAT1 was first used in a study that identified large noncoding RNAs displaying nuclear enrichment. 31 The name accompanying the symbol was " nuclear enriched abundant transcript 1," which has been recorded as a gene name alias by the HGNC. The HGNC were contacted in 2009 by a researcher writing a review on this gene who requested that NEAT1 could be approved for the human gene and Neat1 for the mouse ortholog. The HGNC coordinates with the Mouse Genomic Nomenclature Committee wherever possible to approve equivalent nomenclature for mouse and human orthologs. At that time the human and mouse transcripts had been shown to be necessary for the formation of paraspeckles in the nucleus, 32 and therefore the HGNC agreed upon a name that reflected this function and that could be approved alongside the NEAT1 symbol: "nuclear paraspeckle assembly transcript 1." NEAT1 also has the alias VINC (virus inducible non-coding RNA) based on its detection in mouse brains infected with Japanese encephalitis virus or Rabies virus. 33 As can be seen from Figure 2b, the NEAT1 symbol is overwhelmingly supported by the research community over any of its aliases.

| MALAT1
MALAT1 (metastasis associated lung adenocarcinoma transcript 1) was first identified in a study to find differences in gene expression between tumours of non-small cell lung cancer that metastasised and those that did not. 34 MALAT1 is located close to NEAT1 in the genome of both human and mice and is highly expressed in both species. MALAT1 is localised to nuclear speckles and hence has been given the alias NEAT2, 31 but unlike NEAT1 it is not required for assembly of paraspeckles. The NEAT2 alias is far less published than MALAT1 (Figure 2b). The MALAT1 locus also produces a small cytoplasmic tRNA-like transcript via tRNA processing ribonucleases known as mascRNA (MALAT1-associated small cytoplasmic RNA). 35 Although not restricted to lung cancers, overexpression of MALAT1 has been associated with metastasis in several different types of cancer, 36 though a smaller number of studies have reported that the lncRNA has a tumour suppressor role in some cancers. As the MALAT1 symbol is very well supported, the HGNC has no plans to change this symbol, but we would consider updating the accompanying descriptive gene name in the future to something more informative, if there is community support to do so.

| PVT1
The PVT1 symbol was first used for the mouse ortholog (Pvt1) following its discovery as the major locus for murine plasmacytoma variant translocations. 37 The human ortholog was subsequently found in Burkitt's lymphoma translocations. 38 The HGNC originally approved the gene name "pvt-1 (murine) oncogene homolog" as the descriptive name accompanying the approved PVT1 symbol, but we have since updated this to the simpler name "Pvt1 oncogene," which reflects how this gene is described in many papers. The HGNC no longer references other species in gene names to reduce possible confusion. Studies have reported that the PVT1 promoter regulates the MYC gene, and that presence of the PVT1 transcript is not necessary for this function. 39 The PVT1 gene hosts several microRNA genes and has widely been reported to be able to compete for binding of micro-RNAs. 40 Because it is a microRNA host locus, it also has the alias symbol MIR1204HG based on the most 5 0 miRNA gene in the locus. The PVT1 symbol is highly published and is unique to this gene.

| MORE RECENT EXAMPLES OF lncRNA SYMBOLS APPROVED BASED ON PUBLICATIONS
We hope that many of our more recently-approved lncRNA gene symbols will achieve the same level of support as the above symbols in the scientific literature in the future. Recent examples of approved lncRNA gene symbols that reflect the function of the encoded lncRNA include RENO1 for " regulator of early neurogenesis 1," 41 COSMOC for "cell fate and sterol metabolism associated divergent transcript of MOCOS" 42 and CPMER for "cytoplasmic mesoderm regulator." 43 All of these symbols were agreed with the HGNC prior to publication. We were able to approve the symbol NXTAR post publication 44 but we updated the gene name, with the agreement of the authors, from the published name "next to androgen receptor" to the more functionally informative name "negative expression of androgen receptor regulating lncRNA," which still fits with the NXTAR symbol.

| THE HGNC "STABLE" TAG
As outlined in the HGNC guidelines, 1 we are now committed to keeping the symbols of clinically relevant genes as stable as possible, and minimising changes to well-published gene symbols. In the era of clinical genomics, it is impossible to contact all clinicians, patient groups, charities and interested individuals to inform them of symbol changes, so it is important that the symbols of genes referred to in the clinic are kept as stable as possible. HGNC curators are currently working through a list of clinically relevant genes and adding a "stable" tag onto the Symbol Reports for these genes once curators are satisfied that the approved symbols are appropriate and are unlikely to be changed (see the top of the XIST Symbol Report shown in Figure 1). We have added this tag to over 40 non-coding RNA genes to date, including the two clinically relevant lncRNA genes, MIR17HG and PCA3. MIR17HG has been associated with Feingold syndrome type 2 as shown in the GenCC (Gene Curation Coalition, 45 ) database, while there is now a clinical test that evaluates levels of PCA3 RNA to help assess prostate cancer risk. 46 We have also added the stable tag to the seven highly published lncRNA genes described above, as we have no plans to change these symbols.

| SYSTEMATIC PROTOCOL FOR NAMING ANNOTATED HUMAN lncRNA GENES
In addition to approving lncRNA symbols based on published data, the HGNC has a systematic protocol for naming lncRNA genes that have been manually annotated by the RefSeq annotators at the National Center for Biotechnology Information (NCBI) 6 and/or the GENCODE annotators at Ensembl. 5 Note that the HGNC has a large set of unnamed lncRNA genes to work through; we currently prioritise genes that are mentioned in publications but have no suitable information for a non-systematic symbol, and lncRNA genes that have been annotated by both of the above-mentioned manual annotation projects. The eight categories, along with the non-systematic category based on published data described above, used for this systematic naming are shown in Figure 3. Please also see the decision-making chart published as fig. 1 in Reference 1 and a more detailed description of each lncRNA naming category in Reference 2.
The eight systematic categories of lncRNA genes are as follows: • if an lncRNA gene hosts a microRNA gene in an exon or intron it is named as a microRNA non-coding host gene with the symbol format [microRNA symbol]HG, for example, MIR7-3HG • if an lncRNA gene hosts a small nucleolar (sno)RNA gene it is named as a small nucleolar RNA non-coding host gene with the root symbol SNHG for example, SNHG3 • if an lncRNA gene is intergenic with respect to protein-coding genes it is named as a long intergenic non-protein coding RNA with the root symbol LINC followed by a unique five digit number, for example, LINC02998 • if an lncRNA gene overlaps the genomic span of a pc gene but is located on the opposite strand compared to that pc gene it is named as an antisense RNA with the symbol format [pc symbol]-AS suffixed with a unique number, for example, ABCA9-AS1 • if an lncRNA gene overlaps at least one exon of a pc gene on the same strand, it is named as an overlapping transcript with the symbol format [pc symbol]-OT suffixed with a unique number, for example, PCBP2-OT1 • if an lncRNA is contained within an intron of a pc gene it is named as an intronic transcript with the symbol format [pc symbol]-IT suffixed with a unique number, for example, HAO2-IT1 • if an lncRNA gene shares a bidirectional promoter with a pc gene it is named as a divergent transcript with the symbol format [pc symbol]-DT for example, CIBAR1-DT • if an lncRNA gene has another lncRNA paralog in the human genome, these paralogs may be named with the FAM root symbol (family with sequence similarity), for example, FAM182A and FAM182B. Note that the FAM root symbol is also used for pc genes, but these can be distinguished via locus type.
Although the above protocol is applied where no other suitable information is available at the time of naming, these symbols can become well-established in the literature and so may not necessarily be updated when further data are published, unless there is agreement between research groups working on the genes to do so. Where there is an ortholog in other species, the HGNC may pursue a rename in order that the orthologs be approved with the same symbol and name. For example, the human lncRNA gene DUBR (DPPA2 upstream binding RNA) had the previous symbol LINC00883, while mouse gene Dubr had the previous symbol 5330426P16Rik.

| A CAUTIONARY NOTE ON THE IMPORTANCE OF APPROVED GENE NOMENCLATURE
During our literature searches for papers on lncRNA genes, HGNC curators have noticed that many papers continue to use names based on BAC clones in the human genome assembly, which were used in previous versions of the Ensembl website as symbols, or primary identifiers, for lncRNAs. These clone-based identifiers used to be displayed on Ensembl gene reports for human genes that had no HGNC symbol, but have now been removed completely and are not searchable in the current version of the Ensembl website. We recently found

| PROTEIN CODING GENES THAT WERE PREVIOUSLY ANNOTATED AS lncRNA GENES
It may be surprising to consider that most lncRNA genes contain open reading frames (ORFs) but these are usually short in length, unsupported by conservation in other species, lack structural features such as protein domains, and are not supported by peptides from mass spectrometry. Post annotational experimental evidence may show that such ORFs are translated and therefore the locus types of lncRNA genes may be updated to protein coding.
The following genes were updated based on published data: MTLN -mitoregulin 51 has the previous symbol LINC00116; GREP1 -glycine rich extracellular protein 1 52 was previously LINC00514; NBDY -negative regulator of P-body association 53 was previously LINC01420.
Although the HGNC will usually rename such genes, particularly if a new symbol is proposed by authors, in some cases we may retain the gene symbol and only update the gene name. This is the case for the gene TINCR as this is a well-published symbol that has been retained, while the locus is now annotated as protein coding. The gene name is now "TINCR ubiquitin domain containing" in place of the previous gene name "tissue differentiation-inducing non-protein coding RNA." The TINCR symbol is also still used in papers discussing the protein. 54,55 Note that there are still many recent papers describing TINCR as an lncRNA; it is possible that this gene has both coding and non-coding isoforms but this is true for many protein-coding genes and merits discussion. The HGNC does not approve separate symbols for non-coding isoforms of protein-coding genes, for example, ECRAR (endogenous cardiac regenerationassociated regulator) 56 is listed as an alias of the protein coding PTTG1 gene because ECRAR represents a noncoding variant.
14 | GROUPING TRANSCRIPTS TOGETHER AS lncRNA GENES For protein-coding genes the presence of ORFs provides information to gene annotators on when a set of overlapping transcripts should be grouped into the same gene or split into different genes. There is no equivalent information for lncRNA genes, which means that criteria need to be agreed upon between different annotation groups as to when transcripts should be grouped together as an lncRNA gene and when they should not. The HGNC plans to host a workshop on this subject with annotation groups and selected lncRNA researchers to decide upon guidelines for this issue. We hope that this will result in consistent grouping of transcripts into lncRNA gene models in the future.

| CONCLUSION
The field of lncRNA research continues to grow rapidly each year. Consistent use of approved gene symbols for lncRNA genes will mean that all research papers and associated online resources are easily searchable for lncRNAs. We encourage researchers publishing on new lncRNA genes to contact the HGNC prior to submission. This will enable HGNC curators to check that the proposed symbol follows our guidelines and will prevent changes to gene symbols post publication. HGNCapproved symbols appear on our website, www. genenames.org, and in many key lncRNA resources.

ACKNOWLEDGEMENTS
We thank all members of the HGNC for their helpful discussions on the naming of lncRNA genes and particularly the HGNC alumnus, Dr Matt Wright, for all of his hard work on lncRNAs. The HGNC is funded by Wellcome Trust grant 208349/Z/17/Z and the National Human Genome Research Institute (NHGRI) grant U24HG003345. All authors have read and approved the final manuscript. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.

CONFLICT OF INTEREST
The authors have no conflicts of interest to report.