Pilot RNA‐seq data from 24 species of vascular plants at Harvard Forest

Premise Large‐scale projects such as the National Ecological Observatory Network (NEON) collect ecological data on entire biomes to track climate change. NEON provides an opportunity to launch community transcriptomic projects that ask integrative questions in ecology and evolution. We conducted a pilot study to investigate the challenges of collecting RNA‐seq data from diverse plant communities. Methods We generated >650 Gbp of RNA‐seq for 24 vascular plant species representing 12 genera and nine families at the Harvard Forest NEON site. Each species was sampled twice in 2016 (July and August). We assessed transcriptome quality and content with TransRate, BUSCO, and Gene Ontology annotations. Results Only modest differences in assembly quality were observed across multiple k‐mers. On average, transcriptomes contained hits to >70% of loci in the BUSCO database. We found no significant difference in the number of assembled and annotated transcripts between diploid and polyploid transcriptomes. Discussion We provide new RNA‐seq data sets for 24 species of vascular plants in Harvard Forest. Challenges associated with this type of study included recovery of high‐quality RNA from diverse species and access to NEON sites for genomic sampling. Overcoming these challenges offers opportunities for large‐scale studies at the intersection of ecology and genomics.

Many questions in ecology and evolutionary biology increasingly require combining data from these fields at large scales. In particular, integrated, large-scale analyses of multispecies ecological and phylogenetic data sets have become critical to understanding plant distributions and responses to climate change (Zanne et al., 2014;Swenson and Jones, 2017;Maitner et al., 2018;Enquist et al., 2019;McFadden et al., 2019;Rice et al., 2019;Baniaga et al., 2020;Gallagher et al., 2020;Román-Palacios and Wiens, 2020). Recognizing this need, the National Science Foundation (NSF) recently launched the National Ecological Observatory Network (NEON) to generate large-scale data in areas including species occurrence, phenology, and climate, for ecological communities across the United States (Collinge, 2018;Knapp and Collins, 2019). Metagenomic and genomic sampling are also being used to identify and estimate changes in abundance and composition of some taxa, especially microbial communities (https://www.neons cience.org/ data). Although these data and analyses will be crucial for understanding ecosystem-scale processes, the collection of genomic data from a broader array of species across NEON sites would allow researchers to further integrate ecological and evolutionary processes in the analyses of communities.
Genomic analyses of single species, although important, do not capture the larger patterns occurring within an interacting community of plants. Transcriptome profiling or genome sequencing of multiple species and individuals within a community will open new, integrative avenues of analysis and allow us to address existing questions that require sampling of floras and communities (Bragg et al., 2015;Fitzpatrick and Keller, 2015;Bowsher et al., 2017;Han et al., 2017;Swenson and Jones, 2017;Zambrano et al., 2017;Matthews et al., 2018;Subrahmaniam et al., 2018;Breed et al., 2019). This is especially true for understanding responses to climate change where community-level analyses are needed to capture the interacting dynamics of different species responses (Liu et al., 2018;Komatsu et al., 2019;Snell et al., 2019). The integration of community-level genomic data from non-model species with ecological and trait data will improve our understanding of plant responses to climate change. Collecting genomic data at the community level with repeated sampling that mirrors other trait data collection will permit assessments of the genetic diversity of entire plant communities and how they change over time, estimates of gene flow and hybridization, measurement of in situ gene expression variation across species in response to shared climate events, and a genomic perspective on functional diversity within and between plant communities. Metagenomics analyses of microbiomes have transformed our understanding of and approaches for studying microbial biology (Fierer et al., 2012a, b;Turner et al., 2013;Delgado-Baquerizo et al., 2018;Jansson and Hofmockel, 2020). Similar plant community transcriptomics and genomics studies could open new avenues of research and provide the crucial data to understand plant responses to climate change.
To explore the potential and challenges of plant community transcriptomics, we conducted a pilot RNA-seq study at the Harvard Forest NEON site. Whereas many RNA-seq studies are focused on collections of related species, an approach that simplifies collection and RNA extraction, a major challenge of communitylevel transcriptomics is that a diverse range of plant species need to be sampled for RNA extraction in the field. In this pilot study, we evaluated RNA-seq results generated following a protocol that we developed (Field Setup 2 of Yang et al., 2017) for collecting material at distant field sites and returning samples by shipping. Harvard Forest was selected for this pilot study because of access to a field station that simplified the logistics of working with liquid nitrogen. At Harvard Forest, we sampled 24 species of vascular plants from sites adjacent to the NEON plot. Each species was sampled on two different dates one month apart (in July and August 2016), as close to the same time of day as possible. Species were selected from a phylogenetically diverse range of plants that included ferns, trees, and herbaceous annuals. These plants were selected because they represented the diversity of form and habit that is present in the deciduous forest community at Harvard Forest. Another potential challenge for plant transcriptomics is the abundance of polyploid species and cytotypes (Barker et al., 2016a). With potentially twice as many (or more) genes in a polyploid genome, these species could require more sequencing reads than related diploids to obtain reference transcriptomes of similar quality. To explore the impacts of polyploidy on transcriptome surveys, we made an effort to select sets of related polyploid and diploid species. Here, we give an overview of our data collection, present new reference transcriptomes and translated protein collections for each species, and evaluate the quality of these assemblies using multiple approaches.

Taxon selection and sampling
The Harvard Forest Flora (Jenkins et al., 2008) was used to guide our taxonomic selections and find species to represent each category (diploid/polyploid). Putative diploids and neo-polyploid species were identified from chromosome counts obtained from the Chromosome Counts Database (Rice et al., 2015). Congeneric species pairs were selected based on their phylogenetic relatedness. Our sampling included nine polyploid and 11 diploid species (Table  1). We could not determine the ploidal level of four species. The Harvard Forest Flora Database (Jenkins et al., 2008) was used to locate sampling sites.
Field collection for plant RNA-seq followed the approach described in Yang et al. (2017). The only difference was here we sampled tissue from mature leaves of an apparently healthy individual (e.g., lacking herbivore or pathogen damage) rather than young flower or leaf buds to maintain developmental consistency as much as possible over time. Each target species was sampled from the same population on two different dates about one month apart (July and August) during the 2016 growing season (Fig. 1). We attempted to sample as close to the same time of day as possible on both dates by sampling species in the same order on both trips, but this was not always achievable due to challenges of fieldwork, such as weather and time to relocate sample populations. Leaf tissues were flash-frozen in liquid nitrogen in the field and shipped on dry ice to the University of Arizona for RNA extraction. After leaf tissue collection, additional leaf tissue was preserved on silica for DNA backup, and the remaining plant material was pressed for a herbarium specimen (see Appendix 1 for voucher information and collection details).

RNA extraction and RNA-seq
Total RNA was extracted from leaf tissue collected on each sampling date for all species using the Spectrum Plant Total RNA Kit (Sigma-Aldrich Co., St. Louis, Missouri, USA) following the manufacturer's Protocol A. RNA was used to prepare cDNA using the Ovation RNA-Seq System (catalogue no. 7102-A01; NuGEN, Redwood, California, USA) via single primer isothermal amplification and automated on the Apollo 324 liquid handler (TaKaRa Bio, Kusatsu, Shiga, Japan). cDNA was quantified on the NanoDrop (Thermo Fisher Scientific, Waltham, Massachusetts, USA) and was sheared to approximately 300-bp fragments using the Covaris M220 ultrasonicator (Covaris, Woburn, Massachusetts, USA). Libraries were generated using Kapa Biosystem's library preparation kit for Illumina (KK8201; Roche, Wilmington, Massachusetts, USA). Fragments were end-repaired and A-tailed, and individual indexes and adapters (catalogue no. 520999; Bioo Scientific, Austin, Texas, USA) were ligated on each separate sample. The adapter-ligated molecules were cleaned using AMPure beads (A63883; Agencourt Bioscience/ Beckman Coulter, Indianapolis, Indiana, USA) and amplified with the KAPA HiFi enzyme (KK2502; Roche). Each library was then analyzed for fragment size on an Agilent TapeStation 4200 (Agilent Technologies, Santa Clara, California, USA) and quantified by qPCR (KAPA Library Quantification Kit, KK4835) on the QuantStudio 5 Real-Time PCR System (Thermo Fisher Scientific). Multiplex pooling (13-16 samples per lane) and paired-end sequencing at 2 × 150 bp were then performed on the Illumina NextSeq500 platform at Arizona State University's CLAS Genomics Core facility. Raw read quality was assessed using FastQC (Andrews, 2010).

De novo transcriptome assembly, protein translation, and quality assessment
Raw sequence reads were processed using the SnoWhite pipeline Dlugosch et al., 2013), which included trimming adapter sequences and bases with a quality score below 20 from the 3′ ends of all reads, removing reads that are entirely primer and/or adapter fragments using TagDust (Lassmann et al., 2009), and removing poly(A/T) tails with SeqClean (https://sourc eforge. net/proje cts/seqcl ean/). The cleaned reads from each sampling date were merged and cleaned to synchronize read pairs using fastq-pair  (Edwards and Edwards, 2019) and pooled to assemble a reference de novo transcriptome for each species. Due to the significant time involved in running and evaluating multiple assemblies for each species, we chose five species that represent the phylogenetic diversity of our samples (Dryopteris intermedia (Muhl. ex Willd.) A. Gray, Galium mollugo L., Juglans cinerea L., Plantago major L., and Persicaria sagittata (L.) H. Gross) to identify the optimal k-mer to use for assembling all 24 species. For these five exemplar taxa, we examined the quality of assemblies generated by SOAPdenovo-Trans version 1.03 (Xie et al., 2014) across a range of k-mers (37, 47, 57, 67, 77, 87, 97, 107, 117, and 127). Assembly quality across the different k-mers was assessed by mapping the raw reads to each assembly with TransRate version 1.0.3 (Smith-Unna et al., 2016) and evaluating the optimal assembly scores. TransRate calculates assembly scores by remapping the reads back to the assembly and combining a variety of metrics for each contig, including estimates of whether a base pair was called correctly, whether a base should be a part of the final transcript, the probability that a contig was derived from a single transcript, and the probability that a contig is structurally complete. We selected a k-mer that produced the average highest optimal assembly score across the five species. This k-mer (57, see Results) was used to assemble reference transcriptomes for the entire collection of species.
We used TransPipe  to identify plant proteins from the assembled transcripts for each reference transcriptome and provide protein and in-frame nucleic acid sequences for each species. The reading frame and protein translation for each sequence was identified by comparison to protein sequences from 25 sequenced and annotated plant genomes from Phytozome (Goodstein et al., 2012). Using BLASTX (Wheeler et al., 2008), best-hit proteins were paired with each transcript at a minimum cutoff of 30% sequence similarity over at least 150 sites. Transcripts that did not have a best-hit protein at this level were removed. To determine the reading frame and generate estimated amino acid sequences, each transcript was aligned against its best-hit protein by GeneWise 2.2.2 (Birney et al., 2004). Based on the highest-scoring GeneWise DNA-protein alignments, stop and "N"-containing codons were removed to produce estimated amino acid sequences for each transcript. Output included translated protein sequences and their corresponding nucleic acid sequences.
To assess the quality of the assembled transcriptomes for the full set of 24 species, we analyzed each with TransRate and BUSCO. Summary statistics, including the number of scaffolds, mean scaffold lengths, and N50, were calculated by TransRate version 1.0.3 for all scaffolds as well as for the subset of sequences that were identified as plant proteins and translated. We evaluated the completeness of our transcriptome coverage with BUSCO version 4.0.5 (Seppey et al., 2019). BUSCO compares sequences to a collection of universal single-copy orthologs for the Viridiplantae (Viridiplantae Odb10) and the eukaryotes (Eukaryote Odb10). We also used the TransRate and BUSCO statistics to compare differences in the assemblies of diploid and polyploid species.

Gene Ontology annotation and comparison
Gene Ontology (GO) annotations of all transcriptomes were obtained through translated BLAST (BLASTX) searches against the annotated Arabidopsis thaliana (L.) Heynh. protein database from TAIR (Lamesch et al., 2012) to find the best hit with a length of at least 100 bp and an E-value of at least 1e-10. GO-slim annotations based on the plant GO-slims from TAIR were obtained for the whole transcriptome for each species and presented as a heatmap. The heatmap columns were clustered by hierarchical clustering with default parameters in R with the order of GO categories set arbitrarily by the ranking in Lysimachia ciliata L. Rankings of the GO slim categories were determined by the relative frequency of the GO term among the transcripts in each transcriptome.

RESULTS
We found relatively little variation in the optimal TransRate scores across assemblies with different k-mers. The optimal TransRate scores ranged from ~0.1-0.15, with each of the five exemplar species peaking at different k-mers (Fig. 2). Scores trended downward for all species at higher k-mers, with no sharp peaks in the score apparent in most taxa. The mean k-mer of the top-scoring assemblies for each species was 61, and the closest k-mer to this value (57) was used to assemble reference transcriptomes for all 24 species. Assemblies for most of the 24 species appeared to be of relatively high quality. By combining RNA-seq libraries representing two different time points, we obtained an average of 27 Gbp of reads for each reference transcriptome (Table 1). With a k-mer = 57, the assemblies contained an average of 483,084 scaffolds with a mean length of 281 bp and N50 of 960 bp. The translated nucleic acids for each assembly had an average of 31,470 sequences with a mean length of 652 bp and N50 of 789 bp. We observed no significant relationship between the number of scaffolds or number of translated proteins and sequencing depth (Fig. 3). The mean complete plus fragmented BUSCO percentages were 73.2% against the Viridiplantae database and 76% against the eukaryote database (Table 2). We found that the number of hits to sequences in the BUSCO databases plateaued at around 20 Gbp of sequencing effort and around 20,000 proteins (Fig. 4).
Polyploid species did not have significantly more translated proteins than diploid species, with 31,152 average proteins translated compared to 30,804 ( Fig. 5A; two-tailed t-test: P = 0.95). Similarly, polyploid species did not have a significantly higher proportion of duplicated BUSCO matches than diploids ( Fig. 5B; two-tailed ttest: P = 0.11). In some cases, the number of proteins or duplicated BUSCO proportion was lower when comparing a polyploid species with its related diploids (e.g., Dryopteris Adans.). This may be due to variation in read and/or assembly quality rather than differences in the biology of these species. However, it is not clear that this is due to differences in data quality because in most cases, including Dryopteris, all of the species have similarly high read depth (>20 Gbp).
GO annotations of the transcriptomes of the 24 species were largely similar (Fig. 6). Categories such as other cellular processes, other metabolic processes, and other intracellular components were the largest fraction of all transcriptomes, whereas receptor binding or activity and electron transport or energy pathways were among the smallest. The rank order of each GO-slim category was largely consistent across most species. Species from the same genus were sometimes clustered together by the similarity of their GO-slim representations, such as in Dryopteris and Lysimachia L., but in most cases the species were not clustered with their congeners. Polygonum cilinode Michx. was unique in having many differences in GO category rank compared to the other taxa. It was also the lowest-scoring transcriptome assembly, with only 6088 translated proteins and nearly 80% of BUSCO genes missing (Table 2).

DISCUSSION
Overall, the RNA sampling approach we developed and employed (Field Setup 2 of Yang et al., 2017) allowed us to sequence and assemble RNA-seq data from a diverse range of species at Harvard Forest. The transcriptomes we assembled for 24 species of vascular plants at Harvard Forest appear to be relatively high quality and consistent with our expectations for de novo plant transcriptome assemblies. Our assemblies were reasonably complete, with more than 70% of BUSCO genes present on average. This is a similar distribution of BUSCO scores to those in the recently published 1KP project One Thousand Plant Transcriptomes Initiative, 2019) and other studies (Blande et al., 2017;Evkaikina et al., 2017;Pokorn et al., 2017;Weisberg et al., 2017). In our analyses, BUSCO scores and scaffold numbers appear to plateau after approximately 20 Gbp of sequencing effort for diploid and polyploid species, but previous studies indicate that reference assemblies of similar quality can be generated with substantially less sequencing effort for high-quality RNA samples. For example, data from the 1KP project suggest that as little as 2-3 Gbp of read depth is sufficient . Larger amounts of data were collected in this project to facilitate future gene expression analyses. Notably, the few samples that had low BUSCO scores or BUSCO scores that were relatively low for the sequencing effort, such as Polygonum cilinode, also had lower numbers of translated proteins but more scaffolds than most species. The relatively poor quality of these outlier assemblies is likely related to lower RNA quality rather than to sequencing effort or ploidal level. In contrast, assemblies with higher BUSCO scores yield translated protein numbers that are more consistent with the number of genes in sequenced plant genomes (Michael, 2014;Wendel et al., 2016).  (Shirasawa et al., 2017). However, these comparisons should be interpreted cautiously because transcriptome assemblies can contain multiple isoforms of protein-coding genes. Like many transcriptome assemblies (Johnson et al., 2012;Carpenter et al., 2019;Patterson et al., 2019), our assemblies also contain a large number of small scaffolds (<300 bp). Small scaffolds are likely artifacts of library amplification and sequencing, considering that most did not translate to a known plant protein sequence. We found no significant difference in the number of translated transcripts between the diploid and polyploid transcriptome assemblies. Although this could be due to the modest sample size or variation in the age and fractionation level of polyploids, it may also reflect biological differences in expressed transcriptome size and diversity that impact the number of assembled transcripts. Under a simple null model of polyploid transcriptome size, one may expect to observe an approximate doubling of the diploid transcriptome size that may translate to doubling the number of assembled transcripts. However, recent research indicates polyploid transcriptomes may be smaller than expected. Research in Glycine Willd. has found that the expressed transcriptome size of polyploid species is less than 2× the diploid size (Coate and Doyle, 2015;Doyle and Coate, 2018;Visger et al., 2019). For example, the transcriptome of the allotetraploid G. dolichocarpa Tateishi & H. Ohashi was 1.4× the size of its diploid progenitors (Coate and Doyle, 2010). The apparent lower-than-expected level of the  quantity of gene expression in polyploids may be an artifact of comparing diploids and polyploids without accounting for differences in cell numbers or biomass (Coate and Doyle, 2015;Doyle and Coate, 2018;Visger et al., 2019). However, smaller transcriptome sizes in polyploids may also be related to which genes are expressed at a given time or in a particular tissue. This is likely relevant when comparing the assembled gene space for diploid and polyploid transcriptomes, as we do here. Our non-model reference transcriptomes are built from the expressed genes in each sample rather than being based on a reference genome collection. Thus, only genes and alleles that are expressed will be captured in our assemblies and observed in our comparisons. Not all genes or alleles in a polyploid need to be expressed at one time and the overall diversity of the transcriptome at any given time may look more like a diploid, with other alleles being expressed at different times or tissues. Indeed, differential homoeolog silencing is well characterized in polyploid plants (Adams et al., 2003;Coate and Doyle, 2010) and may reduce the sampled transcript diversity of a polyploid genome. If this is the case, we would expect that sampling across more tissues, development times, and environments would lead to greater sampling of the polyploid gene space. Although RNA spike-ins and cell counting may improve differential expression analyses (Visger et al., 2019), capturing the full genome diversity of non-model polyploid species from RNA-seq assemblies remains an additional challenge. Our pilot study of RNA-seq sampling of diverse species in the field demonstrated some familiar challenges. Building on our past experience with extracting RNA from diverse species (Barker et al., 2008;Dempewolf et al., 2010;Der et al., 2011;Lai et al., 2012;Dlugosch et al., 2013;Hodgins et al., 2014;Barker et al., 2016b;Mandáková et al., 2017;Qi et al., 2017;Yang et al., 2017;An et al., 2019;Carpenter et al., 2019), we developed an approach for this study to obtain high-quality RNA from field samples (Field Setup 2 of Yang et al., 2017). We found that flash-freezing leaves in liquid nitrogen in situ for later RNA extraction worked well for our diverse samples. A few samples, especially Polygonum cilinode, yielded lower-quality RNA, which could potentially be related to leaf age at the time of sampling. Different RNA extraction methods will be needed to deal with the secondary compounds (e.g., polyphenolics) that are present in mature and senescing tissues. Recovering highquality RNA in the field, across a range of time points and from leaves of different ages, will be a challenge for future studies.
Other challenges that will need to be overcome are associated with sampling at NEON sites. Sampling within NEON permanent plots is generally not allowed for collections outside of NEON's own standard protocol, and therefore our sampling was limited to sites adjacent to NEON plots. This limitation raises some  significant issues for researchers who wish to leverage data being collected within NEON sites (https://data.neons cience.org/). First, many NEON sites are located in areas where there is no similar adjacent field site available for sampling, due to land restrictions or ecological variation. We ultimately selected Harvard Forest because we could sample at sites outside of the NEON plot itself. The second major issue is that sampling outside of the NEON plot means that there is no guarantee of continued access to plant populations in the future. There is a great opportunity for ecologists and evolutionary biologists to leverage the wealth of data that NEON is generating for our community. However, access for researchers that wish to conduct RNA and DNA sampling of plants (and other organisms) within NEON sites is an essential issue that requires further development across the network. Sequencing costs will continue to decline over the planned 30-year life span of NEON, and strategies to accommodate sequencing for plants and other eukaryotes will offer opportunities to greatly expand largescale studies at the intersection of ecology and evolution.

AUTHOR CONTRIBUTIONS
H

DATA AVAILABILITY
Raw reads for all samples for 24 species are deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRP127805: https://www. ncbi.nlm.nih.gov/sra/SRP12 7805; BioProject: PRJNA422719). Assembled transcriptomes for each species are archived on Zenodo and available at https://doi.org/10.5281/zenodo.3727312 . Heat map of Gene Ontology (GO) slim categories present in the entire transcriptome of each species. Each column represents the annotated GO categories from each analyzed transcriptome, whereas the rows represent a particular GO category. The colors of the heat map represent the percentage of the transcriptome represented by a particular GO category, with red being highest and purple lowest. The overall ranking of GO category rows was determined by the ranking of GO annotations in the transcriptome of Lysimachia ciliata. Hierarchical clustering was used to organize the heatmap columns.