Several phased siRNA annotation methods can frequently misidentify 24 nucleotide siRNA‐dominated PHAS loci

Abstract Small RNAs regulate key physiological functions in land plants. Small RNAs can be divided into two categories: microRNAs (miRNAs) and short interfering RNAs (siRNAs); siRNAs are further subdivided into transposon/repetitive region‐localized heterochromatic siRNAs and phased siRNAs (phasiRNAs). PhasiRNAs are produced from the miRNA‐mediated cleavage of a Pol II RNA transcript; the miRNA cleavage site provides a defined starting point from which phasiRNAs are produced in a distinctly phased pattern. 21–22 nucleotide (nt)‐dominated phasiRNA‐producing loci (PHAS) are well represented in all land plants to date. In contrast, 24 nt‐dominated PHAS loci are known to be encoded only in monocots and are generally restricted to male reproductive tissues. Currently, only one miRNA (miR2275) is known to trigger the production of these 24 nt‐dominated PHAS loci. In this study, we use stringent methodologies in order to examine whether or not 24 nt‐dominated PHAS loci also exist in Arabidopsis thaliana. We find that highly expressed heterochromatic siRNAs were consistently misidentified as 24 nt‐dominated PHAS loci using multiple PHAS‐detecting algorithms. We also find that MIR2275 is not found in A. thaliana, and it seems to have been lost in the last common ancestor of Brassicales. Altogether, our research highlights the potential issues with widely used PHAS‐detecting algorithms which may lead to false positives when trying to annotate new PHAS, especially 24 nt‐dominated loci.

(2) The main criticism for the lack of clarity of the abstract (for a general audience) can be addressed by introducing the "phasing problem" gradually; the start of the abstract is perfect, after introducing the miRNAs and siRNAs continue along the same lines. E.g. "imprecisely processed" is too vague for readers not familiar with the field; next the role of miR 2275 is not clear -first I would like to see few words on the role of miRNAs in the production of tas loci, and then it would become clearer why the focus is on miR 2275 > A sentence that explains the role of miRNAs in phasiRNA biogenesis has been added to the abstract. (lines 14-16).
(3) I like the conservation analysis of miRNA 2275, however some elements of the analysis should be addressed a. The phylo tree should be to scale and the branches should be used as proxy for speciation distances. Also the multiple sequence alignment would be very informative for the readers and would also reveal the location of mutations (along the mature miR/miR* or the stem and provide a discussion point on the weighted approach that emphasises more the high selection pressure on the mature fragments in contrast to the rest of the stem) > We have done this; it is now included as Figure S2. (~ line 726).
b. Rows 136-140: it is not entirely clear whether you were looking for known miRNAs to trigger the 24-nt dominated locus. > The sentence states "...we predicted whether or not any known A. thaliana miRNAs could target these three loci", so as we stated it is using known miRNAs. (now at lines 145-146).
To make this search exhaustive I suggest to look for putative new miRNAs using a function first approach i.e. identify the putative sequence of a targeting miRNA, then match the read on the genome looking for suitable loci with a hairpin-like secondary structure. The fragment may or may not be present in sequencing libraries -depending on the sequencing bias-but its presence can always be tested using a wet-lab validation approach e.g. northern blot > We thank the reviewer for the suggestion but we have elected to not pursue this analysis for two reasons: 1) In the end, we conclude that these loci are not truly phased anyway; that's the major conclusion of the whole study. 2) While we acknowledge there could be biases in sRNA-seq libraries that would obfuscate discovery of certain small RNAs, the approach of reverse-searching for siRNAs or miRNAs in absence of any sRNA-seq data has a dismal history of false positives in the previous literature.  (4) General comment: you are using the terms locus and cluster interchangeablyplease choose one term and be consistent throughout. I prefer the term locus since this reflects the biological function of a location in the genome. I know that the term cluster has been used for some time, however it is misleading in the current machine-learning environment and be avoided, unless it refers to the identification of patterns in an unsupervised manner.
> Thank you, we agree and have changed to the term locus / loci throughout the revised manuscript.
(5) Row 376 avoid using blast for drawing conclusions on sRNAs -the algorithm was designed to work on longer sequences. Other tools such as bowtie or patman are more appropriate for this task.
> We have re-done the search using bowtie (allowing up to two mismatches) and have found largely the same results: We lost two hits (Musa accuminata (Banana) and Brachypodium stacei) but found two others (Brachypodium distachyon and Malus domestica). The revision uses the bowtie method.
(6) Row 402: "200 nts" the secondary structures depend on the input (the quality and length), a search of the best possible secondary structures using incremental windows is more suitable (and will provide a clearer answer). In addition, I recommend testing some locus identification algorithms like segmentseq (Hardcastle et al 2010) or colide (Mohorianu et al 2013) and use the pattern characterisation from these tools to refine your results. > Thank you for the suggestion. We did not pursue segmentseq or colide because this particular analysis did not involve analysis of genome-aligned sRNA-seq data. Instead, this analysis was merely predicting RNA secondary structures from genomic regions around bowtie hits to miR2275. Using mFold, we have manually investigated subsections of the arbitrary +/-200 nt genomic windows and confirmed that the predicted hairpins we report are those with the minimum deltaG. We find that the mFold-predicted secondary structures are quite robust regardless of how much or how little flanking sequence is included in the secondary structure predictions.
(7) For the plots of secondary structures ... first include the name of the miRNA being plotted, second try to either provide a justification for weird structures (e.g. C. sinensis) or exclude them.
> Fig S1 has been modified to include microRNA names. We don't believe there are any predicted secondary structures that are too "weird" although of course this is a judgment call. (line 712-717).
(8) Row 732 -for the radial plots, what do the numbers on the y-axis indicate? Are these abundances (linear or logarithmic scale)? If the abundance scale is linear, provide some justification that these are not in the noise range. Fig S6 are percentages, plotted on a linear scale. This is noted in the figure legend for Fig S6 (lines 759-763). As we describe in the main text (lines 150-155), the first three plots (our 3 'false positive' loci) indeed do seem to be just noise .. no strongly predominating phase register and little reproducibility between different libraries. TAS2 is included as a positive control.

> The numbers on the radial plots of
(9) Row 751 : the presence plots seem to be artificially elongated; I would like to see a comparison with other methods or a justification for the length of these loci.
> We agree that the boundaries of our siRNA loci are set by automated detection and that alternative methods of locus discovery might set the locus boundaries differently. The exact method of locus-finding used by our ShortStack script are now described explicitly in the main text: "All distinct genomic intervals containing one or more primary sRNA-seq alignments within 75 nts of each other were obtained, and then filtered to remove loci where the total sRNA-seq abundance with a locus was less than 0.5 reads per million." (lines 444-447). We agree that an in-depth comparison with other small RNA locus-finding methods would be interesting, but we contend that it is quite a bit beyond the scope of this study: Implementing multiple locus-finding techniques amounts to re-doing every other downstream analysis for our entire study. We have added text in several places of the revision discussing the fact that locus-boundary settings may also influence false-discoveries of phased siRNA loci, especially because our 'false' ones were very long (Fig 3C).

Minor comments:
(1) row 45: the phrase "single-stranded, stem and loop" should be replaced wit "single stranded RNA with a hairpin like secondary structure" > Done. (line 47).
(2) row 51: some words are missing "biogenesis of hc-siRNAs begins with the transcription of TEs"; also hc-siRNAs are not restricted to TEs, they can also derive/target from promoters and other siRNA loci along the genome > Agreed, rephrased. (lines 51-54) (3) row 61: a citation is necessary at the end of the sentence.

> Done. (lines 68-70).
(4) Row 201: the lengths of these loci are worrying and I would like to see a full description of reads properties and a justification that these loci are above the noise level.
> Please see response to your Major Comment #9 above. In regard to this figure  (Figure 3c), whatever one thinks about the method by which the boundaries of the loci were determined, it is the same method applied consistently, so the comparison across the three classes of loci seems valid. Our 'false positive' loci are clearly quite long relative to the bulk of 24 nt siRNA loci and we've added that to the discussion. (lines 350-353).

> Done (refers to
(6) Row 256: the rationale is not entirely clear -do clarify it.

> Done (lines 270-272).
(7) Row 276: as remarked earlier the comparisons presented in subplot (a) are not meaningful. You could try a subsampling approach to answer the question: "how often would we get a similar approach when 8/4 loci are selected at random"; however, given the low number of loci, I doubt that the subsampling would be stable in itself. Figure 5. We've re-done the analysis with the suggested approach, reflected in the new Figure 5 (~ line 288). Qualitatively, the results are much the same: No obvious biases for gene/TE overlaps or for multi-mapping reads, but clear bias toward long clusters with abundant siRNAs.

> Refers to
(8) Row 846: this is only a remark -it is very useful to see the information on the sequencing adapter. Due to the strong effect of sequencing bias, I would not combine libraries built with different technologies -each sequencing adapter reveals part of the sRNA population; the sequencing libraries are not wrong, just incomplete. Table S2. We agree that adaptergenerated sequence bias is real. We have not combined libraries made by different methods for differential expression analyses. All of the differential expression analysis was between wild-type and mutant libraries made by the same lab with the same methods; there weren't cross-adapter comparisons made with respect to differential expression.