Insights & Perspectives MicroRNA annotation of plant genomes Do it right or not at all

The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-pro t purposes provided that: • a full bibliographic reference is made to the original source • a link is made to the metadata record in DRO • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders. Please consult the full DRO policy for further details.


Introduction
MicroRNAs (miRNAs) are short noncoding trans-acting regulators of gene expression in animals and plants, among other eukaryote kingdoms [1], that have been demonstrated to be important regulators of protein coding genes [2]. They are expressed in a great diversity of tissues and developmental stages [3], implicated as causal factors in macroevolutionary change [4], play fundamental roles in disease-resistance [5] and stress tolerance, and they are targeted for the bio-engineering of domesticated species [6]. Consequently, miRNA annotation has become an essential element of draft genome descriptions, and particularly so for plant species. Guidelines and recommendations for the accurate annotation of miRNAs have existed since the early days of miRNA research [7], and have been updated as technology and understanding have advanced. However, adherence to the established annotation practises varies from publication to publication and, unfortunately, while some of these are exemplars of best practise, the quality of miRNA annotation is generally very. While the problem of spurious miRNA annotation may be widely appreciated [8][9][10][11][12][13], plant genome annotation studies are contributing disproportionately to the corruption of public databases such as miRBase, serving to obscure rather than elucidate understanding of miRNA diversity, function and evolution. For example, though there are only 28,645 entries from across all of life in miRBase (v.21), 28,281 miRNA loci were recently described from soybean [14], and 98,068 from bread wheat [15]. It is extremely improbable that all but a small minority of these are valid.
Many of these problems of miRNA annotation are manifest in the draft genome of the orange, Citrus sinensis [22], where a large number of miRNAs were assigned to families identified elsewhere as exclusive to disparate lineages of plants and animals. For example, 38 novel families identified in C. sinensis, are considered exclusive to the European honeybee, Apis mellifera, in miRBase. If these annotations are correct, the implications are stunning À indicating either a common origin of animal and plant miRNAs, or evidence of horizontal transfer between plant and pollinator. The miRNA repertoire of C. sinensis is surprising in other ways, too, including the apparent absence of a number of ancient and otherwise highly conserved miRNA families common to most plant species, and the presence of many miRNA families listed on miRBase that are known to be spurious.
In an attempt to highlight the procedural errors that plague miRNA annotations in many draft genome publications, we undertook a thorough reanalysis of the miRNAome of C. sinensis, with use of novel sequencing data.

sRNA sequencing is integral to miRNAome annotation
Although the criteria for the identification of miRNAs are long established [7,38], these have been updated over time to account for the depth of coverage of NGS data [39,40]. The key features of a genuine miRNA are: (i) the canonical "hairpin" structure from which the functional "mature" strand and miRNA Ã strand are processed, (ii) a homogenous mode of processing by which few sequences other than the mature and miRNA Ã are processed, and (iii) the presence of the miRNA Ã strand indicating processing by Dicer. Many software packages exist to identify miRNAs according to these rules, but in order to balance completeness of annotation with accuracy, it is not uncommon for there to be a high frequency of false positives, necessitating manual inspection of each annotated miRNA to ensure strict adherence to annotation criteria.
Approaches for the identification of miRNAs can be categorised as those based on conservation with other species, and de novo identification based on NGS data. Both methods rely on the identification of structural features that indicate processing by the canonical miRNA biogenesis pathways, but only small RNA sequencing allows de novo identification. The central tenets of accurate annotation are presented in Fig. 1; additional methodological details presented in Supplementary information.
The majority of annotations in the C. sinensis miRNAome are invalid Following this pipeline, analysis of the same sRNA data set used for the C. sinensis draft genome [22] identified only 98 of the 227 miRNA loci previously described as fulfilling all of the annotation criteria: 77 loci exhibited no evidence of miRNA Ã expression; 54 loci exhibited heterogeneous processing of multiple products; 24 loci were classified in the incorrect family; 22 loci were duplicates; 20 loci had no evidence for any sRNAs being processed; and 10 loci had invalid hairpin structures ( Fig. 2; Tables S1-2). Ninetynine of the miRNAs described were attributed names (MIR6001-MIR6070) that miRBase has allocated to a wide range of other taxa including Apis, Homo, Tribolium, Brassica, Nicotiana, and Solanum. For example, the miR-6001 entry on miRBase is a miRNA that has only previously been described from honeybee. Finally, the C. sinensis draft genome [22] failed to identify 17 miRNA loci that had been described previously by members of the same team [41].
Our resequencing of C. sinensis small RNAs, combined with the data used in the draft genome, supports the annotation of 111 of the 227 loci (Tables S3  and Table S4), with invalid hairpin structure or imprecise processing invalidating the miRNA annotations of 62 of the 227 loci. Thirty-two loci lack sufficient evidence required for annotation, viz. expression of the miRNA Ã or, indeed, evidence of expression of the mature miRNA itself, along with the 22 duplicated sequences. The bulk of these spurious loci are among the putative novel families, leaving a residue of just nine valid miRNAs from among the 99 novel loci proposed. De novo identification of miRNAs using Mirdeep-P identified a further 11 miRNA loci and 21 additional valid loci were obtained from miRBase. In total, the known orange miRNA repertoire comprises 143 miRNAs attributable to 45 families, 10 of which are known only in C. sinensis (Fig. 2, Tables S5 and S6).
Over half of the miRNAs annotated in the C. sinensis draft genome failed to satisfy the established criteria required for annotation, with many loci exhibiting evidence that directly contradicts their identification as miRNAs (Fig. 2). Based solely on the data used in the original study, even fewer of these loci should have been annotated as miRNAs (only 98 of the 227 loci). The reasons for the large number of false positives are many and varied, but followed generally from a lack of adherence to the established guidelines for miRNA annotation that require strong evidence of processing by DCL-1. A two nucleotide offset between the mature and miRNA Ã strands provides strong evidence for DCL-1 processing. Without this evidence, the large number of unique reads and potential miRNA-like hairpin structures in the genome will, by chance, result in many reads in the 20-24 nt range being mapped to hairpin-like genome loci even though they are not processed by DCL-1. While both the 5Z and 3Z strands of miRNAs can be functional [42], it is common for one strand, the miRNA Ã , to degrade rapidly after disassociation from the mature strand and so the lack of miRNA Ã reads may be explained by low levels of expression within the cell, rather than evidence that the loci is not a miRNA. Consequently, the 25 loci in the orange genome that lack only this feature can be described as candidate miRNAs, but they should not be annotated as miRNA loci until evidence of miRNA Ã expression is presented. In contrast, the 57 loci that exhibit heterogeneous processing across the precursor sequence, and the 10 loci that show insufficient complementarity in their precursor structure or mature sequence to allow DCL-1 processing, are certainly not miRNAs. Finally, the exciting possibility of miR-NAs shared between orange and honeybee was revealed to be an artefact of inconsistency and inaccuracy in the naming of miRNAs; 22 miRNAs were synonyms of other loci and 23 miRNAs were assigned to the wrong family. All phylogenetically predicted miRNA families [9] are present except MIR828 and MIR3630, which are absent from the genome.

Avoiding the inclusion of spurious miRNAs in genome annotation
There appear to be three primary failings that result in the poor quality of genome annotations. The first of these, and the most significant, is the unquestioning inclusion of all miRNA loci predicted by bioinformatic software, without manual verification of the veracity of these loci. There are many high quality software packages available for the de novo prediction of miRNA loci, but in these there is always a trade off between producing a complete description of the miRNAome and reducing the number of false positives. Three of the leading packages for miRNA identification are miRdeep-P [45], Shortstack [44], and the UEA sRNA workbench [45]. All of these are excellent and extremely useful tools, but in each case the associated publications acknowledge that their software will yield false positives. The only way to avoid the misannotation of such loci as miRNAs is to manually verify that every locus satisfies the established rules of miRNA annotation. Secondly, due to an accumulation of previous incorrect annotations, both in miRBase and wider literature, any homology search must use a valid set of miRNAs to avoid the perpetuation of historical misannotations. Finally, the completeness of a miRNAome is rarely considered in relation to the suite of miRNAs that should be anticipated based on the phylogenetic position of a species. Hence, many miRNAome descriptions overlook the presence of ancient miRNA families. Ultimately, confidence in miR-NAome annotation requires data from at least one small RNA library. Without this, the validation of novel loci is not possible, as the precision of processing and evidence of DCL-1 processing cannot be assessed without these data.
Is there a real and meaningful distinction between miRNAs and siRNAs?
The distinction between the tightly controlled processing of miRNA products and heterogeneous processing of siRNAs, is key to the correct identification of miRNA loci. The earliest miRNA families described were conserved across broad taxonomic groupings, exhibited extremely high levels of processing precision and, thus, were clearly distinct from siRNAs [46]. This distinction is often less clear among younger miRNAs, specific to young lineages, because of more heterogeneous processing [47,48]. It has also been hypothesised that miRNAs can evolve from siRNA loci, suggesting that "proto" miRNAs exist that are in the process of evolving but that do not yet exhibit an unambiguous signature of canonical miRNA biogenesis [49]. Hence, some researchers have claimed that the distinction between miRNAs and other siRNAs is false or arbitrary, perhaps serving more to obscure than inform the origin and function of the small RNA regulatory machinery [50].
There can be no doubt that both miRNAs and siRNAs are important functional classes of small RNAs and, indeed, there are examples in which miRNAs have evolved from and into siRNAs [51,52]. However, in the vast majority of instances, miRNAs and siRNAs are readily distinguishable based on the established annotation criteria À principally, the consistency of their processing. In almost all miRNAs, consistency of processing of an evolutionarily conserved locus yields a distinct product that binds to a specific and equally conserved target sequence. siRNAs are not conserved in the same manner and the heterogenous nature of their processing bears this out. These factors make the distinction between miRNAs and siRNAs useful in the sense that it reflects functional, regulatory, as well as evolutionary differences between these two classes of small RNAs. At the very least, knowledge of the function of specific miRNAs can be translated between species in which they are conserved, in a manner that is not possible among siRNAs, except in a much more general sense. Thus, for both theoretical and functional reasons, we advocate a continued distinction between these two classes of small regulatory RNAs.

Conclusions and outlook
The annotation of the orange miR-NAome is particularly poor, but it is not otherwise unrepresentative of miRNA annotation in draft genomes. We have already highlighted the implausibly large repertoire of 98,068 miRNA loci in bread wheat [15] and 28,281 loci in soybean [14]. Similarly, yet less implausibly, 537 miRNA loci were recently described from mungbean [20], but over half are clearly spurious, assigned erroneously to miRNA families known only from animals. Even a cursory consideration of sequence homology, perhaps prompted by phylogenetic expectations of the mungbean miRNAome [9], could have avoided the annotation of yet more spurious miR-NAs. Neither the annotation criteria nor the pipeline we describe for their implementation are novel, they are simply not being observed. Consequently, spurious miRNAs are being described at an increasingly alarming rate, diminishing prospects for understanding, and exploiting the role of miRNAs in plant biology. This eventuality was foreseen by the original authors of the criteria for plant miRNA annotation who predicted that without adherence to a strict set of annotation criteria, there would be widespread "annotation of miRNAs that have a high likelihood of being siRNAs; many sequences of such uncertain provenance are likely to be identified for a broad range of plant species" [38]. This prophecy has now come to pass, and the research community must react decisively to improve the accuracy of miRNAome annotations and, particularly, the generally poor quality annotations associated with draft genomes which contribute more than their fair share to the corruption of miRNA repositories. Undoubtedly, this is at least in part a consequence of the scope of plant genome descriptions, which has expanded with knowledge of the universe of coding and non-coding genes and regulatory elements. The same problem is not generally manifest in the annotation of animal genomes as miRNAomes are more usually described in distinct publications where their justification may achieve greater scrutiny. This may be the most practical solution to the problem of miRNA misannotation but, ultimately, we implore genome annotation teams that aim to tackle the miRNAome À do it right or not at all.