Novel genome characteristics contribute to the invasiveness of Phragmites australis (common reed)

Abstract The rapid invasion of the non‐native Phragmites australis (Poaceae, subfamily Arundinoideae) is a major threat to native wetland ecosystems in North America and elsewhere. We describe the first reference genome for P. australis and compare invasive (ssp. australis) and native (ssp. americanus) genotypes collected from replicated populations across the Laurentian Great Lakes to deduce genomic bases driving its invasive success. Here, we report novel genomic features including a Phragmites lineage‐specific whole genome duplication, followed by gene loss and preferential retention of genes associated with transcription factors and regulatory functions in the remaining duplicates. Comparative transcriptomic analyses revealed that genes associated with biotic stress and defence responses were expressed at a higher basal level in invasive genotypes, but native genotypes showed a stronger induction of defence responses when challenged by a fungal endophyte. The reference genome and transcriptomes, combined with previous ecological and environmental data, add to our understanding of mechanisms leading to invasiveness and support the development of novel, genomics‐assisted management approaches for invasive Phragmites.

by shaking, centrifuged, and the supernatant discarded. This process was repeated a second time and 500mL DI water was added to the rinsed mycelium and blended for 30 seconds in a sterile blender, and transferred to a 550mL plastic spray bottle for several hours before application.
Abundant mycelium was visible on a microscope slide at 200X magnification following a single spray, but few or no fungal spores were present. Control plants received identical treatments but with sterile needles, sterile toothpicks, and a 2% liquid culture media solution in sterile DI water.
Treated plants were then covered and secured in a plastic bag overnight to maintain high humidity. The growth chamber was kept dark for 36 hours after inoculation before resuming a 14:10 hour light/dark cycle.

Plant tissue sampling and inoculum isolation. Two inoculated and two control plants
per genotype were sampled eight days after inoculation treatment applications for transcriptome analysis. Soil was washed away in tap water and the plant was blotted dry before the treated tillers and attached rhizome were processed. We sampled the first 2cm of rhizome directly connected to the infected tiller. If more rhizome tissue was present, 0.5cm rhizome sections were taken every 2cm or until a total of 5cm was sampled (2cm + six 0.5cm samples). The rhizome sections were cut in half lengthwise with a clean razor blade to replicate samples for RNA extraction. Three leaves from the infected tiller were each cut into quarters after a sample was taken to confirm inoculation. The top and bottom quarter of each leaf were pooled to replicate samples for RNA extraction.
Leaf tissue samples were taken from the three inoculated leaves per plant rubbed with sand to confirm the effectiveness of inoculation. A 1cm 2 leaf section from each leaf was cut into small fragments, placed in small plastic tissue cassettes and rinsed for 20 minutes under running tap water individually. Leaf fragments within the tissue cassettes were then washed for 1 minute in 95% ethanol, 3 minutes in 0.18% sodium hypochlorite, 1 minute in 95% ethanol, and 2 minutes in sterile DI water following Christian et al. (2019). Each leaf fragment was removed from the tissue cassette, air-dried on a sterile paper towel and placed on the surface of a corn meal agar (CMA) petri plate supplemented with 25mg tetracycline/L for 5 minutes to confirm the efficacy of surface sterilization. Each fragment was then cut in half with a sterile razor to expose the plant interior and plated on another CMA + tetracycline plate to determine if subsequent fungal growth corresponds to the original Alternaria alternata inoculum.

RNA Extraction.
Tissue samples were grounded in liquid nitrogen in porcelain mortars and pestles while the mortar was resting in dry ice. Mortars and pestles were previously wrapped in aluminum foil and baked in a drying oven at 180°C for six hours to destroy RNase and then pre-chilled in a -20°C freezer. Grounded tissue and liquid nitrogen was decanted into a 2mL microcentrifuge tube and liquid nitrogen was allowed to evaporate. ~1mL of RLT buffer with 10μl 2-mercaptoethanol per mL was then added. Each sample was promptly vortexed and put on ice until transfer to long-term storage at -80°C. Between samples (except replicates), mortars and pestles were washed with soap and water, sprayed with RNase AWAY (Molecular BioProducts, Inc.), and re-chilled. Most samples were processed within ten minutes of cutting.
RNA was extracted using the RNeasy Plant Mini Kit (Qiagen) following the manufacturer's protocol with the following minor modifications. Samples were thawed on ice, and centrifuged for 15-30 seconds at 10,000xg to pellet plant tissue and allow for pipetting of the supernatant. Then, 300 µl of sample was added to the Qiashredder column (up to ~800ul for samples with lower concentration). Extracts were stored at -80°C in RNase-free tubes.

Reference genome assembly
For the reference genome sequencing and assembly, we chose the PacBio sequencing platform to benefit from its long read lengths. We selected Canu as the primary genome assembler so that the error-prone PacBio long reads can be corrected based on the consensus among themselves (Koren et al., 2017). To ensure efficient correction in the hierarchical assembly by Canu without the need for a secondary sequencing technology, we produced PacBio reads that represent a more than 40-fold sequencing depth for a haploid genome size 1Gb, estimated based on the 1C value of 1.01pg reported for Phragmites australis (Pyšek et al., 2018).
The completeness of the resulting primary assembly was confirmed by the high complete BUSCO content (93.3%; Fig. 3  We report all assembled contigs as the reference genome sequences without an additional purging step. While there has been a concern recently raised on incomplete haplotig and duplicate purging for genome assemblies based on long reads (Guiglielmoni et al., 2021;Murigneux et al., 2020), currently available tools for haplotig/duplicate purging have not been designed nor tested for highly repetitive plant genomes with either polyploidy or recent whole-genome duplication and so could result in over-purging true duplicated genomic regions (Guan et al., 2020;Roach et al., 2018).
Determination of non-reference allele frequencies in single nucleotide polymorphism (SNP) loci.
Before gene model prediction, we aligned paired-end short reads (150 nucleotides x2) derived from the same genomic DNA samples used for the reference genome assembly, to the reference genome sequence using bowtie2 (v. 2.2.6) (Langmead & Salzberg, 2012). We then determined non-reference allele frequencies in genomic regions with homologies to Benchmarking Universal Single-Copy Ortholog (BUSCO) sequences (Chan et al., 2017) using mpileup command of samtools (v. 1.8) with a Phred quality score cutoff of 20 (Li, 2011) ( Supplementary Fig. 1). To determine the genomic regions harboring putative BUSCOs, we first detected transcripts homologous to the 956 BUSCO HMM profiles (Chan et al., 2017) from the "Invasive combined" transcriptome (Supplementary Table 2). These transcripts were mapped to the genomic regions using GMAP (v. 2012-04-21) (Wu & Watanabe, 2005).
Identification of "Conserved" and "Duplicate" groups among P. australis reference gene models.
Using OrthoFinder (v. 2.2.7) (Emms & Kelly, 2019) and MMseqs2 (Steinegger & Söding, 2017), we identified ortholog groups (OGs) among deduced protein sequences from the primary reference gene models of P. australis and five comparator monocot species. For each OG, we calculated the ratio of ortholog copy number (RCN) for each species, as defined as the ortholog copy number in a species divided by the mean ortholog copy number of all other species (Fig.   4a). OGs with the standard deviation of ortholog copy number across all species exceeding 10% of the total number of orthologs were excluded since such OGs tend to include lineage-specific expansion of transposable element-like genes whose copy numbers are less relevant with the P.
Among P. australis OGs, we identified "Conserved" and "Duplicated" groups whose RCN values are between 0.75 to 1.25 ("Conserved") and larger than 1.5 ("Duplicated"), respectively ( Fig. 4a and Supplementary Table 1), representing P. australis orthologs whose copy numbers have been fractionated to the level similar to other monocot species ("Conserved") or retained as duplicated after the WGD event. a Selected for comparative analyses in Fig. 3 and Fig. 4.

Supplementary
b Based on "embryophyta odb10" with total 1,375 Benchmarking Universal Single-Copy Orthologs (BUSCOs), shown are percentages of complete (C) single-copy (S), duplicated (D), fragmented (F), and missing (M) BUSCOs. Stats for open reading frames (ORFs) for all de novo assembled transcriptomes. a RNA-seq reads (150nt x2) from leaf, shoot, and rhizomes of the same clone used for genome assembly were assembled with Trinity (v. 2.4.0) and SPAdes (v. 3.10.0) and consolidated using the evigene pipeline. This transcriptome was used as the "EST hints" for reference gene model annotation. b Half of all RNA-seq reads (37nt x2) from the three native genotypes were combined and assembled as described in Methods. The Native rhizome transcriptome was the representative of P. australis native subspecies (ssp. americanus) in Fig. 3c and 3d. Using the Native leaf transcriptome produced similar results. b All RNA-seq reads (37nt x2) from each tissue were assembled to transcriptomes for each genotype, as described in Methods. The rhizome transcriptomes represented the six genotypes in Fig 5b. The leaf transcriptomes produced a near identical tree.   Supplementary Fig. 3 | The fifty largest synteny blocks detected within the P. australis genome. P. australis genome contigs harboring the fifty longest synteny blocks were compared to themselves using SynMap (https://genomevolution.org/coge/SynMap.pl) and visualized as a dot plot with each syntelog pair shown a dot color-coded based on the synonymous substitution rates (Ks) between the pair. The histogram (upper panel) shows both the color keys and distribution for Ks. Example synteny blocks representing P. australis-specific duplication and the ρ duplication shared among the grass lineage are marked with arrows. Supplementary Fig. 4 | RNA-seq analysis comparing basal-level expression among native and invasive P. australis genotypes. Representative "MA"-plots (Love et al., 2014) showing distribution of log2 fold differences (y-axis) and mean normalized RNA-seq counts (x-axis) for NOH1 and IOH1 genotypes. RNA-seq reads from both NOH1 and IOH1 genotypes were aligned to the reference genome sequence and basal-level expression between genotypes were compared after normalization using total number of aligned reads, as detailed in Methods. Reference genes with significantly different basal-level expressions (adjusted p-value<0.05, estimated by DESeq2 (Love et al., 2014)) are in red. Note that despite the relatively lower percentage of total reads aligned to the reference genome sequences among native genotypes compared to invasive genotypes (Supplementary Table 2), the MA-plots did not show notable shifts towards lower basal-level expression in the native genotype. All other native and invasive genotype pairs showed similar patterns, corroborating the similar numbers of significantly differentially expressed genes (DEGs) between native and invasive genotype pairs in Fig. 5d.  Supplementary Fig. 5 | An example network of Gene Ontology (GO) terms enriched among P. australis reference genes with higher basal-level expression in an invasive genotype. The network generated using BiNGO (Maere et al., 2005) where non-white colors indicate sighnificant adjusted p-value of enrichment with among P. australis reference genes with higher basal-level expression in the leaf samples of the IOH1 than the NOH1 genotype. Each GO term was indicated as a circle node, with its diameter mirroring the size of the GO term, and connected 10 -2 <10 -7

Supplementary
Adjusted P to its parent and child GO terms. Note that among the child GO terms of "response to stimulus," defense-related GO terms, e.g. "defense response" (red box), showed the highest levels of enrichment, while other child terms such as "response to abiotic stimulus" was not significantly enriched. For number of genes in each GO terms and detailed p-values for genes showing different basal-level expression between all genotype pairs, see Supplementary Dataset 4.