TagSeq for gene expression in non‐model plants: A pilot study at the Santa Rita Experimental Range NEON core site

Premise TagSeq is a cost‐effective approach for gene expression studies requiring a large number of samples. To date, TagSeq studies in plants have been limited to those with a high‐quality reference genome. We tested the suitability of reference transcriptomes for TagSeq in non‐model plants, as part of a study of natural gene expression variation at the Santa Rita Experimental Range National Ecological Observatory Network (NEON) core site. Methods Tissue for TagSeq was sampled from multiple individuals of four species (Bouteloua aristidoides and Eragrostis lehmanniana [Poaceae], Tidestromia lanuginosa [Amaranthaceae], and Parkinsonia florida [Fabaceae]) at two locations on three dates (56 samples total). One sample per species was used to create a reference transcriptome via standard RNA‐seq. TagSeq performance was assessed by recovery of reference loci, specificity of tag alignments, and variation among samples. Results A high fraction of tags aligned to each reference and mapped uniquely. Expression patterns were quantifiable for tens of thousands of loci, which revealed consistent spatial differentiation in expression for all species. Discussion TagSeq using de novo reference transcriptomes was an effective approach to quantifying gene expression in this study. Tags were highly locus specific and generated biologically informative profiles for four non‐model plant species.

Gene expression studies that involve sampling many individuals or tissues can be powerful for identifying variation in transcriptional activity and function (e.g., among populations, over time, in response to the environment or other treatments; Gould et al., 2018;Mead et al., 2019), as well as the structure of transcriptional networks and the genetic basis of gene expression variation (Wisecaver et al., 2017;. Such studies are of high interest for non-model species responding to natural environments, as well as for model species (Matz, 2018;Zaidem et al., 2019). Quantitative next-generation sequencing of expressed genes, known as RNAseq, has made expression studies broadly accessible for non-model species, but remains expensive per sample and difficult to scale up for questions that require high numbers of replicates (Lohman et al., 2016).
A cost-effective approach is to target only a small region of each transcript for sequencing, identifying and quantifying its expression while avoiding sequencing across its full length. Several versions of this approach have involved reading a short tag of sequence upstream of the poly(A) tail of mRNA. These methods have their roots in one of the first approaches to RNA sequencing prior to the next-generation sequencing era, expressed sequence tags or ESTs (Parkinson and Blaxter, 2009). Meyer et al. (2011) developed an updated version for next-generation applications, which continues to be used and adapted (e.g., Dixon et al., 2018;Kremling et al., 2018;Mitchell et al., 2019;Pallares et al., 2020). Recently, Lohman et al. (2016) published further developments of what has become known as TagSeq (also TAGseq or Tag-seq) and compared its performance with standard RNA-seq of full transcripts. Notably, they find that TagSeq achieves higher accuracy than standard RNA-seq, presumably because sequencing effort is distributed more evenly to all transcripts when only the tag sequence is targeted.
TagSeq tags are short sequences, however, and must be aligned to a reference to fully identify the loci that are being expressed (Meyer et al., 2011). For non-model species and multi-species studies, high-quality reference genomes are not likely to be available. Instead, assembled reference transcriptomes can be generated using standard RNA-seq (Matz, 2018). Reference transcriptomes will differ from genomes in that not all loci will be represented by transcripts present in a given sample, not all transcripts will be assembled to full length, and the assembly will vary in the degree to which splice variants, alleles, and paralogs will occur as unique sequences or be merged (Meyer et al., 2011;Yang and Smith, 2013;Carpenter et al., 2019;Patterson et al., 2019). These issues will reduce the number of TagSeq reads that can be uniquely mapped to the reference, relative to a full genome, and they may be particularly problematic in plants where gene and genome duplications are common (Barker et al., 2016;One Thousand Plant Transcriptomes Initiative, 2019;Li and Barker, 2020), although sequencing at the variable 3′ untranslated region (UTR) should maximize locus discrimination (Rise et al., 2004). Meyer at al. (2011) originally demonstrated the TagSeq method in a non-model species of coral, where tags were aligned to a reference transcriptome. Many subsequent studies have successfully used a similar approach in other non-model animals (e.g., Kenkel and Matz, 2016;Dixon et al., 2018;Kriefall et al., 2018;Rocker et al., 2019). In plants, however, TagSeq studies to date appear to have been confined to model species for which a high-quality reference genome is available (Meyer et al., 2014;Des Marais et al., 2015;Lovell et al., 2016;Kremling et al., 2018;Chu et al., 2019;Razzaque et al., 2019;Weng et al., 2019). How TagSeq will perform using a reference transcriptome in plants is not clear given the lack of such studies and a paucity of relevant performance information for TagSeq.
Here we report a pilot study using TagSeq to quantify gene expression for four plant species, as part of a study of gene expression variation at the Santa Rita Experimental Range and National Ecological Observatory Network (NEON) core site (Green Valley, Arizona, USA). We assembled a reference transcriptome for each species using standard RNA-seq and analyzed gene expression using TagSeq across multiple individuals for each species, sampled at two locations and three time points. We evaluated the fraction of tags that map uniquely to loci in the reference transcriptome, and the specificity of mapping against references from the same sample, from another sample of the same species, and from other species. We further evaluated the performance of TagSeq in terms of the number of reference loci observed as a function of TagSeq sequencing effort, and the variation in TagSeq profiles across species, sites, and times. Our goal was to assess whether TagSeq is a locus-specific and biologically informative approach for non-model species lacking a high-quality reference genome.

Sampling
Our pilot study focused on four commonly occurring species at the Santa Rita Experimental Range Long Term Research and NEON core site (SRER; Fig. 1, Appendix 1). These include the native species Tidestromia lanuginosa (Nutt.) Standl. (Amaranthaceae; woolly tidestromia), Parkinsonia florida (Benth. ex A. Gray) S. Watson (Fabaceae; blue palo verde), and Bouteloua aristidoides (Kunth) Griseb. (Poaceae; needle grama), as well as the introduced species Eragrostis lehmanniana Nees (Poaceae; Lehmann lovegrass; native to southern Africa). All species were identified using a combination of the historical flora of the SRER (Medina, 2003), the Arizona Flora (Kearney et al., 1960), and the Flora of North America (Flora of North America Editorial Committee, 1993). Based on chromosome counts of these and congeners in the Chromosome Counts Database (Rice et al., 2015), we infer that P. florida and T. lanuginosa are likely diploid species, while the grasses B. aristidoides and E. lehmanniana are both likely to be tetraploids. Vouchers were deposited in the University of Arizona herbarium (ARIZ; Fig. 1, Appendix 1).
Tissue from mature plants was collected from an apparently healthy individual representing each target species weekly on three dates (September 1, 7, and 13) during the 2017 growing season. An entire stem was sampled for B. aristidoides (with flowers and fruits) and E. lehmanniana (without flowers or fruits). Only leaves and leaflets were sampled for P. florida and T. lanuginosa. At each sampling date, 2-4 individuals were sampled from each species at each of two locations ("Phone Pole" and "Grassland"; 4-6 samples total/species/date; Fig. 1, Appendix 1). For P. florida, samples at the same location and date were not from different individuals, but instead were multiple collections of tissue from the same individual (replicates). The same individual was also resampled at each time point for P. florida, and individuals from the same population were sampled for B. aristidoides, E. lehmanniana, and T. lanuginosa. Samples were collected in the same order on each day beginning at the Phone Pole location and as close as possible to the same time of day (afternoon). Leaf tissues were flash frozen in liquid nitrogen in the field and transported to the University of Arizona for RNA extraction. Total RNA was extracted from leaf tissue using the Spectrum Plant Total RNA Kit (Sigma-Aldrich, St. Louis, Missouri, USA) following the manufacturer's Protocol A.
Several aspects of reference assembly quality were assessed. Summary statistics including the number of contig scaffolds, scaffold lengths, and N50 were calculated by TransRate version 1.0.3 (Smith-Unna et al., 2016). The completeness of transcriptome coverage was quantified using BUSCO version 4.0.5 (Seppey et al., 2019), which identifies representation of a collection of universal single-copy orthologs for the Viridiplantae (Viridiplantae Odb10) and the eukaryotes (Eukaryote Odb10). Finally, reference contigs matching known proteins were identified using TransPipe (Barker et al., 2010), in which contigs were compared to protein sequences from 25 sequenced and annotated plant genomes from Phytozome (Goodstein et al., 2012) using BLASTX (Wheeler et al., 2008). Besthit proteins were paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. To determine the reading frame and generate estimated amino acid sequences, each gene was aligned against its best-hit protein by GeneWise 2.2.2 (Birney et al., 2004). Based on the highest-scoring GeneWise DNA-protein alignments, stop and 'N'-containing codons were removed to produce estimated amino acid sequences for each gene (archived at https:// doi.org/10.5281/zenodo.3740232; Marx et al., 2020b).

TagSeq gene expression
TagSeq libraries for all samples were prepared and sequenced at the University of Arizona Genomics core center. Total RNA was used to prepare TagSeq libraries according to the detailed protocol given in Lohman et al. (2016), with the DNase I step included (QIAGEN #79254; QIAGEN, Valencia, California, USA). RNA was fragmented using NEBNext RNA fragmentation buffer (New England Biolabs, Ipswich, Massachusetts, USA), cleaned using RNAClean XP beads (Beckman Coulter), and quantified using RNA PicoGreen (Life Technologies, Carlsbad, California, USA). cDNA was synthesized using forward primers with four degenerate bases near the 3′ end (Eurofins Scientific, Luxembourg City, Luxembourg) for the identification of PCR duplicates, and then PCR amplified for 15 cycles, incorporating sample-specific barcodes. PCR products were purified using AMPure XP beads (Beckman Coulter), and a Pippen Prep electrophoresis system (Sage Science, Beverly, Massachusetts, USA) was used for 400-500-bp size selection. DNA was quantified using DNA PicoGreen (Life Technologies) and pooled in equal representation. The final library was quantified using the KAPA SYBR FAST ABI Prism qPCR Kit (Roche). A total of 56 samples (Appendix 1) were sequenced together on one lane of the NextSeq 500 platform (1 × 75 bp High-Output Kit; Illumina). All primer sequences were unmodified from those given in Lohman et al. (2016).
Tag sequences were cleaned of several potential contaminants before analysis. PCR duplicates were identified as sequences that were identical over the first 57 bases, which included the four-base degenerate primer region, three-base GGG RNA priming region, and 50 additional bases of unique sequence (using the script 'removeP-CRdups57'; Marx et al., 2020b). The program 'cutadapt' version 1.9.1 (Martin, 2011) was used to trim the 5′ degenerate primer region, 3′ poly(A) tails (eight or more bases), 3′ low-quality bases (minimum score 20), and primer/adapter contaminants with a minimum overlap of 8 bp. Reads less than 57 bases after trimming were discarded. The remaining reads were considered unique sequence tags.
To quantify expression of each locus, tags were aligned to the reference transcriptomes using BWA-mem version 0.7.17 (Li and Durbin, 2010) with a bandwidth of 5 bp (-w 5; because gaps relative to the transcriptome reference are not expected in these tag sequences). All other parameters were set at the default value. The number of hits to a reference sequence (expressional level) was tallied using HtSeq-count version 0.5.4 (Anders et al., 2015), with --stranded=no (the reference assembly is not stranded). A GTF file was generated from the transcriptome assembly for use with HtSeqcount (using the script 'create_GTF.pl'; Marx et al., 2020b). Hits to each locus were combined across samples and filtered for loci with a minimum of five hits across each species' data set to reduce erroneous hits due to sequencing errors (using the script 'combine_HtSeq. pl'; Marx et al., 2020b).
We evaluated the performance of our TagSeq data in terms of recovery of reference loci, specificity of tag alignments, and variation in expression patterns among samples. To assess the ability of TagSeq to track loci in a reference transcriptome, we plotted the proportion of the reference sequences to which tags aligned as a function of TagSeq sequencing effort (total reads) and fitted a logarithmic curve to identify patterns of saturation with sequencing effort. To examine the specificity of the tags, we quantified the fraction of tags that mapped to multiple reference loci. We also compared the number of tags aligning to references when (i) the reference and TagSeq were derived from the same sample, (ii) the reference and tags were derived from different samples of the same species (individuals or populations), and (iii) the reference and tags were derived from different species.
Finally, we assessed expression differences among samples with MDS ordination of all TagSeq samples for a species. R/vegan version 2.4-3 (Oksanen et al., 2016) was used to calculate the relative abundance matrix across loci and samples, and R/limma version 3.26.9 (Ritchie et al., 2015) was used to calculate root-mean-square deviation (Euclidean distance) among samples and construct the ordination. Distances were based on the loci with the largest standard deviation among all samples (gene.selection = "common"). The number of top loci used was determined by the median value of loci observed among samples: 60,000 for B. aristidoides, 54,000 for E. lehmanniana, 52,000 for P. florida, and 110,000 for T. lanuginosa.

RESULTS
Raw data for RNA-seq references and TagSeq gene expression were deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (BioProject #PRJNA599443). Reference sequencing included 64-83 million raw reads and 63-81 million clean reads, per species (Table 1). Assembly metrics indicated that the most complete assembly was obtained for P. florida, with 78% BUSCO recovery, N50 of 895 bp, and the largest fraction of contigs translating to known proteins (Table 1). The two grass species yielded the least comprehensive reference assemblies, with 45% and 49% BUSCO scores for E. lehmanniana and B. aristidoides, respectively, and N50 values below 600 bp for both species. Assembly metrics were generally intermediate for T. lanuginosa, although it had the largest number of assembled contigs and contigs translating to proteins. Notably, despite having the second largest sequencing effort, E. lehmanniana had the smallest maximum contig size, lowest BUSCO score, and fewest contigs matching known proteins, suggesting that contigs assembled more poorly for this species relative to the others.
TagSeq libraries included a range of 2.6 million to 9.9 million raw reads per sample, except for two samples with low read counts: E. lehmanniana Sample 1 with 335,000 reads, and T. lanuginosa Sample 9 with 1.2 million reads (Appendix 2). Read cleaning resulted in a low proportion of reads removed due to quality issues (typically <10%). In contrast, PCR duplicates accounted for 42-61% of reads (for all samples other than Sample 1, for which PCR duplicates were 71% of reads).
Among the remaining unique tags, >80% of tags aligned to reference sequences for most samples, other than those of E. lehmanniana. For E. lehmanniana, 56-65% of tags aligned to the reference sequences (Appendix 2). The fraction of tags aligning to more than one reference was low across all samples (<1%), including those of the two tetraploid grass species. The fraction of the RNA-seq reference sequences that were observed in TagSeq samples ranged from 10-24% (excluding Samples 1 and 9, which had low read counts), resulting in 33-45% of references observed across all samples together. Requiring that a tag be observed at least five times reduced the fraction of references observed by approximately half for each species.
The number of reference sequences observed among tag sequences was related to the level of TagSeq sequencing effort (Fig.  2, Appendix 2). All species showed trends toward increases in the proportion of reference loci recovered with increasing sequencing effort, although all trends appeared to be saturating and additional sequencing was predicted to result in only modest increases in references observed. TagSeq samples that were the same as the RNA-seq reference sample did not have disproportionately high matches to the reference sequence for their sequencing effort (Fig. 2); however, tag alignments to the reference sequence were highly species specific (Table 2). Between the two grass species (B. aristidoides and E. lehmanniana), 15-18% of tags aligned to the reference of the other species. For all other combinations of species, 7% or fewer tags aligned to a heterospecific reference.
Ordinations for each species revealed clear variation in gene expression among samples (Fig. 3). All species showed clear separation between samples from different locations (closed vs. open symbols, Fig. 3). For E. lehmanniana, Sample 39 was strongly differentiated from all other samples along Axis 1 (Fig. 3B inset), and excluding this sample from the distance matrix allowed further resolution of variation among the remaining samples (Fig. 3B). Samples from different dates within a location had a weaker tendency to separate (different symbol shapes; Fig. 3), such that samples from the same location and date did not always cluster together.

DISCUSSION
We evaluated the performance of TagSeq for surveys of gene expression in non-model plant species, using repeated sampling of four species and alignment of tags to de novo assemblies of RNAseq reference transcriptomes. We found that a high fraction of tags TABLE 1. RNA-seq reference assembly summary statistics for each species. Included are the numbers of raw reads, clean reads, assembled contigs, and contigs aligning to proteins (translating), as well as the N50 and maximum contig length (bp) and the percentage of BUSCO sequences matching contigs (complete and partial) in the Viridiplantae database.

Species
Raw aligned to each reference, and few tags mapped to multiple loci or to transcriptomes of other species. Samples from different locations showed clearly differentiated expression profiles for all four species, and these patterns were robust to sampling across three dates. Our results support the TagSeq approach as an effective means of generating specific and informative expression profiles in non-model plants.
Quality filtering of tags resulted in very low losses of data (<11% of sequences for all but the most poorly sequenced sample in our data set), but PCR duplicates comprised 42-60% of samples. PCR duplicates are commonly abundant in Illumina library preparation methods (Aird et al., 2011), andLohman et al. (2016) reported PCR duplicates of >70% for their test of the protocol used here. The large fraction of sequences involved in PCR duplicates emphasizes the importance of utilizing degenerate bases for identification and removal of duplicates when quantifying expression, as well as the importance of minimizing PCR cycles to maximize sequencing effort on sequences of interest.
For E. lehmanniana, >60% of filtered tags mapped to reference loci for most samples, and for nearly all remaining samples of the other species >80% of tags mapped to the reference. For the tags that did not map, at least three factors could explain their failure to align and the variation in alignment rates among species. First, reference loci must include the sequence at the 3′ end of the transcript, immediately upstream of the poly(A) tail, where TagSeq reads will be located. RNA-seq read distribution is random along the transcript, and therefore many loci will fail to include the required region by chance, and the fraction of loci lacking this region will vary among samples and with the sequencing effort used in creating the reference (Meyer et al., 2011;Conesa et al., 2016;Matz, 2018). Indeed, E. lehmanniana in particular showed evidence of having the least well-assembled transcriptome among our references. Second, Lohman et al. (2016) found that TagSeq was more sensitive to low levels of expression than was RNA-seq. This difference in sensitivity could result in novel low-expression tags in the TagSeq data set, for which there is no representative locus in the RNA-seq reference. Finally, allelic differences between samples could cause tags to fail to align to a reference sequence from another individual, although in our data set we did not see lower rates of alignment in samples that were different than that used for the RNA-seq reference libraries.
For tags lacking a reference sequence, it would be possible in principle to cluster similar tags and to score their expression levels. We observed very low rates (<1%) of mapping to multiple reference loci, which suggests that clustering methods should be able to group tags into inferred loci without high rates of merging across different true loci. Without a reference sequence, however, no information would be available about the identity and function of those loci, which is typically the goal of expression studies (Conesa et al., 2016). Other references (e.g., annotated whole genomes of related species) could be explored for tag identification, but our analyses found that alignment rates to heterospecific loci were low (<20% within the same family, <10% between families).
From the perspective of the RNA-seq reference library, a large fraction of reference loci (typically >80%) were not observed in individual TagSeq samples. Again, the samples used for both RNA-seq and TagSeq did not recover a greater number of reference loci, suggesting that neither sequence differences between reference sequences and tags nor differences in genes expressed among samples explained the failure to observe a large number of reference loci in the tags. Additional TagSeq sequencing effort did not result in large gains in the observation of reference loci, although the combination of all samples roughly doubled the fraction of loci observed relative to any one sample, suggesting that tag sequencing effort within the range of our study will affect the number of loci observed. As described above, missing sequence information at the 3′ end of reference loci will also have a large influence on alignment rates, and will set an upper limit on the FIGURE 2. Proportion of RNA-seq reference loci with aligned tags, as a function of sequencing effort (raw read number) of TagSeq libraries. Logarithmic best fits are shown for each species: Bouteloua aristidoides (BOAR, circles), Eragrostis lehmanniana (ERLE, squares), Parkinsonia florida (PAFL, upward triangles), and Tidestromia lanuginosa (TILA, downward triangles). Samples that were used for both a TagSeq library and the RNA-seq reference are indicated by dark filled symbols. Two additional replicates of the reference P. florida individual collected on the same day are indicated by lightly shaded symbols. Reference loci were required to be observed in a minimum of five tags across a data set to be counted. fraction of loci that can be observed. In their initial publication of the next-generation tag sequencing method, Meyer et al. (2011) also report that >80% of reference transcriptome sequences were poorly represented in their tag sequencing, and they suggest that this may be due to sequencing errors in the reference data set. Only 3.7-8.5% of our reference loci aligned to known proteins, and the number of loci translating to proteins was much more consistent with numbers of genes known from well-studied genomes (Marx et al., 2020a), suggesting a large number of erroneous loci in our references. These issues regarding reference transcriptome quality could also explain differences in the maximum fraction of loci recovered among the different species. Finally, we used ordinations to explore whether our resulting TagSeq expression data showed evidence of biologically relevant structure among samples, which would be amenable to further differential expression analyses. Our analyses revealed distinct separation of expression profiles between samples taken from different collection locations within each species. Spatial samples separated into non-overlapping groups along the first (major) axis of ordinations for P. florida and T. lanuginosa, and through a combination of both axes for B. aristidoides. Spatial samples for E. lehmanniana converged for a few samples along axis 1. Temporal samples also appeared to group together within spatial locations for some combinations of dates, sites, and species, but additional sampling would be required to resolve temporal patterns robustly. Only one sample (Sample 39 for E. lehmanniana) across all species was an outlier in ordination space, such that it clustered far from the other samples and obscured variation in the remaining data set until it was removed.
In summary, we found that TagSeq expression profiles were biologically informative and showed little evidence of problems with tag specificity against non-model transcriptome reference data sets. A large proportion of reference loci were not represented in the TagSeq data set, however, suggesting that completeness of reference assemblies (i.e., assembly of the 3′ end) is likely to influence the identification of loci being expressed. Nevertheless, TagSeq quantified the expression of tens of thousands of loci for each species and revealed important patterns of differentiation among samples in our data set, suggesting that this is likely to be a fruitful approach for high-throughput gene expression studies in non-model plants.

DATA AVAILABILITY
Raw sequence data for RNA-seq references and TagSeq gene expression have been deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive under BioProject #PRJNA599443. RNA-seq assemblies, translations, and all custom scripts are available at https://doi.org/10.5281/zenodo.3740232 (Marx et al., 2020b).