Transcriptome assembly and annotation of johnsongrass (Sorghum halepense) rhizomes identify candidate rhizome‐specific genes

Abstract Rhizomes facilitate the wintering and vegetative propagation of many perennial grasses. Sorghum halepense (johnsongrass) is an aggressive perennial grass that relies on a robust rhizome system to persist through winters and reproduce asexually from its rootstock nodes. This study aimed to sequence and assemble expressed transcripts within the johnsongrass rhizome. A de novo transcriptome assembly was generated from a single johnsongrass rhizome meristem tissue sample. A total of 141,176 probable protein‐coding sequences from the assembly were identified and assigned gene ontology terms using Blast2GO. Estimated expression analysis and BLAST results were used to reduce the assembly to 64,447 high‐confidence sequences. The johnsongrass assembly was compared to Sorghum bicolor, a related nonrhizomatous species, along with an assembly of similar rhizome tissue from the perennial grain crop Thinopyrum intermedium. The presence/absence analysis yielded a set of 98 expressed johnsongrass contigs that are likely associated with rhizome development.

systems also exacerbate nutrient runoff. Corn, wheat, and rice have shown nitrogen fertilizer uptake efficiencies ranging from 18 to only 49% (Cassman, Dobermann, & Walters, 2002). In fact, the loss of nitrate N through subsurface drainage may be 30 to 50 times greater in annual than in perennial crops (Randall & Mulla, 2001).
Perennial plants present further ecological advantages via carbon sequestration. Typical annual crops increase soil organic carbon by 0-30 g m 2 in a year, while perennials plants were found to accumulate 32-44 g m 2 (Robertson, Paul, & Harwood, 2000). Perennial crops not only store more carbon, but also have no requirement to expend fossil fuels on tillage. They have a negative global warming potential, in CO 2 equivalents, while annual crops actually increase atmospheric carbon (Robertson et al., 2000).
Sorghum halepense is a perennial grass which has been classified as an invasive weed in 53 countries (Holm, Plucknett, Pancho, & Herberger, 1977). Sorghum halepense rhizomes, which are known to regenerate even when cut into pieces, allow the plant to endure winters and extend locally (Howard, 2004). One plant may grow 275 feet of rhizomes in a single growing season. Moreover, johnsongrass is self-fertile with a high rate of seed production, yielding more than 80,000 seeds in the same growing season (Howard, 2004;Johnson, Kendig, Smeda, & Fishel, 1997). This propensity for reproduction, coupled with rapid growth relative to native grasses, can give johnsongrass a competitive edge and lead to monocultures in unsupervised areas. Johnsongrass also exhibits the ability to grow in a wide range of environmental conditions and has resistances to many common herbicides and pathogens (Howard, 2004;Johnson et al., 1997).
With such vigorous root systems and high fecundity, this species is an excellent model for understanding one important strategy of the perennial growth habit. Sequences that are expressed specifically in Johnsongrass rhizomes and found to be related to resource management or asexual reproduction could improve prediction of perennial behavior in perennial grasses (Jang et al., 2009).
Attempts at breeding a perennial hybrid from diploid S. bicolor and tetraploid S. halepense began in earnest before 2003 at the Land Institute in Salina, Kansas (Cox et al., 2006). Developing a hybrid with sufficient yield may require an extensive time-frame, but this can be expedited through the use of genetically informed techniques. Marker-assisted selection could make use of QTLs discovered in johnsongrass that are linked to rhizome activity (Xiong et al., 2007).

Previous studies have aligned johnsongrass RNA transcripts with
QTLs likely related to rhizome function in sorghum and rice (Jang et al., 2006(Jang et al., , 2009, but lack the sheer quantity of data expected from sequencing technologies today. This study provides a large, novel transcriptome for rhizome tissue from johnsongrass. The sequenced rhizome RNA was assembled, filtered, and annotated with up-to-date tools for nonmodel organisms. An annotated transcriptome for johnsongrass rhizomes then facilitates the discovery of additional rhizome-enriched transcripts. Further analysis with BLAST compared assembled johnsongrass sequences with nonrhizomatous and rhizomatous species and has led to a reduced set of candidate genes for rhizome-related function.

| Plant origin and RNA extraction
Johnsongrass plants were harvested at the coordinates 38°46 0 15.46″ N and 97°34 0 21.77″ W near Salina, Kansas on July 7, 2015. They were shipped overnight on dry ice to Sioux Center, Iowa, where the rhizome apical meristems were removed, immersed in liquid nitrogen, and placed in a À80 o C freezer. RNA was extracted from four samples of rhizome node buds using the RNeasy Plant Mini Kit (Qiagen) precisely following recommended protocols. DNase digestion was then performed with the TURBO DNA-free kit (Thermo Fisher Scientific), followed by the RNA cleanup procedure from the RNeasy mini handbook. Each sample received an A 260 /A 280 ratio greater than 2 in a NanoDrop 2000 Spectrophotometer (Thermo Scientific). All four samples were then sent to the University of Minnesota Genomics Center for sequencing.
Thinopyrum intermedium rhizome tissue was obtained from a clone of a plant derived from The Land Institute's breeding program (C3-3471). RNA was extracted as described above.

| Library construction and next-generation sequencing
The four johnsongrass samples were run through a denaturing

| Trinity assembly and TransDecoder identification of protein-coding sequences
The cleaned reads from each organism were de novo assembled in Trinity (version 2.4.0) (Grabherr et al., 2011;Haas et al., 2013). The johnsongrass assembly was run through TransDecoder (version 3.0.1), a companion to Trinity which uses open reading frames within assembly transcripts to find likely coding regions (Haas & Papanicolaou, 2017).

| Annotation with Blast2GO PRO
The predicted coding sequences from TransDecoder were annotated in Blast2GO PRO, an all-in-one tool that befits large-scale functional annotation with Gene Ontology (GO), especially for an original assembly of a nonmodel organism. The suite performs the BLAST algorithm to align FASTA-formatted sequence inputs with homologs in specified databases. Annotations are mapped to query sequences by their association with BLAST hits (Conesa et al., 2005).
The BLASTX-fast search was used, aligning the coding sequences against the NCBI nonredundant protein database (NR). Query sequences with BLAST hits were mapped to GO terms and assigned functional annotations where available. See Supporting Information Data S2 for Blast2GO results on the TransDecoder predicted coding sequences.

| Estimated transcript abundance with kallisto
To produce approximate abundances of the assembly transcripts, the quantification software kallisto (version 0.44.0) was used via the Trinity toolkit. Kallisto makes an index and performs pseudoalignment to quickly quantify the expression of a set of transcripts, while maintaining similar or greater accuracy than existing methods (Bray, Pimentel, Melsted, & Pachter, 2016). See Supporting Information Data S1 for the top 20 most expressed S. halepense sequences from kallisto with descriptions and annotations.

| Assembly filtering
The johnsongrass assembly was further reduced by filtration of estimated expression levels and top BLAST hit organisms. All lowly expressed transcripts, having a FPKM (fragments per kilobase transcript per million) less than 1, were removed from the assembly as possible background transcription (Davidson et al., 2012). This threshold may have also excluded truly expressed sequences, but was supported by the distribution of estimated abundances (Supporting Information Figure S2). All transcripts with a top BLAST hit that was not a plant were also removed, being potential contaminants. The final, reduced assembly then contained sequences that were very likely to be truly expressed and very unlikely to be from contaminant reads.

| Comparative transcriptomics with S. bicolor and Th. intermedium rhizomes
Two sequenced genomes closely related to S. halepense were selected: S. bicolor (a close annual relative and possible ancestor of S. halepense (Jang et al., 2009)) and Th. intermedium (a recently sequenced rhizomatous perennial grass). The JGI Plant Flagship genome of nonrhizomatous S. bicolor (version 3.1.1) was accessed from the NCBI RefSeq database (Accession ID: ABXC03000000). The protein model reference sequences were formatted into a protein BLAST database. Predicted transcripts for the rhizomatous species Th. intermedium were obtained from a prepublication annotated assembly (JGI annotation version 2.1) (Dorn et al., unpublished-pending release onto Phytozome) and were formatted into a second protein BLAST database for comparison.
Using the S. halepense proposed protein-coding sequences as queries, two BLASTX searches were performed against the custom protein databases of Th. intermedium and S. bicolor. The query sequences that received one or more BLAST hit (e-value < 10 À20 ) were cross-listed between the two searches, and removed. The S. halepense sequences that aligned in rhizomatous Th. intermedium but had no BLAST hits in nonrhizomatous S. bicolor were selected as candidates for rhizome-related function.
A rhizome-specific assembly for Th. intermedium was also generated in this study (NCBISequence Read Archive SRX3529031) and used to further reduce candidates. The assembly was aligned to the  Figure S3 for an illustration of the distribution of contig lengths for the filtered assembly. An N50 value of 1,071 bp indicates that the contigs of this length or longer make up 50% of the total assembly length in base pairs. Table 1 lists the statistics for the sequencing/assembly process as a whole.

| Estimated abundance and assembly filtering
Quantification with kallisto identified 75,153 lowly expressed transcripts (FPKM < 1), which were removed from the assembly. After

| Annotation
A BLASTX-fast search against the NCBI nonredundant protein database (NR) aligned 57,985 (90%) of the johnsongrass filtered assembly sequences with at least one hit (e-value < 10 À20 ). GO terms associated with BLAST hits were then mapped to 49,368 (76%) of the query sequences. Lastly, functional annotations were found for 46,792 (73%) of the sequences (Table 2).
Sorghum bicolor was the most prominently aligned species by far, receiving 51,544 of the top BLASTX hits (89%). The next highest amount of hits for a plant species was 3,953 (7%), from Zea mays (maize). A collection of fungi, bacteria, and animals accounted for 1,576 (2%) of the top alignments, but many of these may be from symbiotes or pathogens dwelling within the roots. See Supporting Information Figure S4 for the distribution of BLASTX top hits among the twenty most-hit plant species.
A total of 322,100 gene ontology terms were mapped to the filtered johnsongrass contigs. These GO terms were split into three categories by their associations: 104,705 were for biological processes, 125,645 were for molecular functions, and 91,750 were for cellular components. The most-mapped term was cellular "membrane" and was associated with 17,252 sequences. The results of gene ontology mapping and most common terms are listed in Table 2. Supporting Information Figure S5 depicts the distribution of GO level 1 terms over the three categories.

| Comparative transcriptomics
Of the 141,176 johnsongrass coding sequences, only 991 of the johnsongrass contigs had a BLASTX hit from rhizomatous Th. intermedium and no hit from annual S. bicolor. Of these selected contigs, 259 had the same BLASTX hit from Th. intermedium as a sequence from the Th. intermedium rhizome assembly. After removing sequences which may be lowly expressed or from nonplant organisms, 98 transcripts remained, which corresponded with 58 unique, putative peptide sequences from the Th. intermedium reference genome (Supporting Information Data S4 and S6, respectively). Figure 1 depicts the BLASTX comparisons of the johnsongrass contigs.

| DISCUSSION/CONCLUSION
This study was meant to be an exploration of the rhizome transcriptome in a nonmodel organism. The lack of sample replication and of T A B L E 1 Quantitative overview of sequenced reads before and after cleaning, assembled contigs, and predicted coding sequences from TransDecoder Using presence/absence analysis via BLASTX searches against a

Sequencing and assembly statistics
Th. intermedium rhizome assembly and other genomes, 98 sufficiently expressed S. halepense sequences were selected as potentially related to rhizome behavior. This comparison method is dependent on BLASTX search parameters and a stringent e-value < 10 À20 , so some sequences without or with lower quality matches may also have been relevant.
In  (Sharma et al., 2014). That these two sequences are highly expressed in harvested rhizomes is not surprising.
To gain more insight into the annotations of the rhizome-specific terms were found to be overexpressed and 6 to be under-expressed.
The overexpressed terms included karyogamy and a variety of DNA processes and activities, which may be linked to cell division and coincide with rhizome behavior. The rest of the terms were for processes and activities of vitamin B6, which may be linked to development and nitrogen metabolism (Colinas et al., 2016). perenniality. This study provides an important step in the process, as well as providing a high-confidence set of contigs (64,447 predicted contigs) for the, as yet, unsequenced S. halepense.
The 98 candidate sequences will now be the focus of further bioinformatics analysis (e.g., protein functional prediction, enhanced annotation, and BLASTing). In particular, promising candidates may become the target of genetic modification (e.g., knockout; overexpression) experiments to elucidate genes critically part of grain crop perenniality.
Quantitative trait loci associated with rhizome function may be used to improve breeding practices for perennial traits in a Sorghum hybrid. This study has produced a comprehensive dataset which facilitates further research with S halepense and other perennial grains.

DATA ACCESSION
The johnsongrass dataset has been registered with NCBI as a BioProject (PRJNA417857). The raw read files are available in the Sequence Read Archive (SRP125786), and this Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GGDZ00000000. The version described in this article is the first version, GGDZ01000000.
The Th. intermedium dataset has been registered with NCBI as a BioProject (PRJNA428355). The raw read files are available in the Sequence Read Archive (SRP127982), and this Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GGEN00000000. The version described in this article is the first version, GGEN01000000.

AUTHOR CONTRIBU TI ONS
NLT, KMD, and JP conceived and supervised this study. NR and MH extracted S. halepense RNA from the rhizomes harvested by LDH and processed by JP. KMD extracted from the Th. intermedium rhizomes and assembled both rhizome assemblies after sequencing. SL, MA, NLT, and NR performed the annotation and analysis. NR wrote the manuscript with contributions from all other authors.