A chromosomal assembly of the soybean cyst nematode genome

The soybean cyst nematode (Heterodera glycines) is a sedentary plant parasite that exceeds billion USD annually in yield losses. This problem is exacerbated by H. glycines populations overcoming the limited sources of natural resistance in soybean and by the lack of effective and safe alternative treatments. Although there are genetic determinants that render soybeans resistant to nematode genotypes, resistant soybeans are increasingly ineffective because their multiyear usage has selected for virulent H. glycines populations. Successful H. glycines infection relies on the comprehensive re‐engineering of soybean root cells into a syncytium, as well as the long‐term suppression of host defences to ensure syncytial viability. At the forefront of these complex molecular interactions are effectors, the proteins secreted by H. glycines into host root tissues. The mechanisms that control genomic effector acquisition, diversification, and selection are important insights needed for the development of essential novel control strategies. As a foundation to obtain this understanding, we created a nine‐scaffold, 158 Mb pseudomolecule assembly of the H. glycines genome using PacBio, Chicago, and Hi‐C sequencing. A Mikado consensus gene prediction produced an annotation of 22,465 genes using short‐ and long‐read expression data. To evaluate assembly and annotation quality, we cross‐examined synteny among H. glycines assemblies, and compared BUSCO across related species. To describe the predicted proteins involved in H. glycines’ secretory pathway, we contrasted expression between preparasitic and parasitic stages with functional gene information. Here, we present the results from our assembly and annotation of the H. glycines genome and contribute this resource to the scientific community.

(Heterodera glycines) is of particularly great economic importance due to its prominent role in reducing soybean yields worldwide (Davis & Tylka, 2000;Jones et al., 2013;Koenning & Wrather, 2010;Nicol et al., 2011). Overcoming this crop pest requires scrutinizing the H. glycines lifestyle and the molecular exchange at the core of this problem.
The lifecycle of H. glycines begins as an egg that is queued to hatch. The emerged juvenile nematode migrates to the host root zone, where it penetrates the outer layers of roots using a combination of mechanical and enzymatic processes, and eventually induces a single root cell near the vascular cylinder to form a feeding site, known as a syncytium (Pogorelko et al., 2016). The syncytium becomes metabolically active and expands to incorporate hundreds of adjacent cells through cell wall breakdown and protoplast fusion. The syncytium matures to an efficient nutrient sink with enlarged host nuclei and pronounced cytoplasmic streaming (Abad & Williamson, 2010;Lilley et al., 2005).
Successful feeding site development depends upon the parasite's ability to manipulate a complex interaction with its host via the transfer of nematode gland cell-produced effector proteins into or around host root cells Mitchum et al., 2013;van den Akker et al., 2014). During juvenile nematode migration within the root, plant cell walls are digested by an abundance of secreted enzymes including cellulases, pectate lyases and other hydrolases (De Boer et al., 2002;Gao et al., 2003;Rai et al., 2015). In later parasitic stages, the nematode manipulates plant metabolism (Bekal et al., 2003), development (Matthews et al., 2014;Mitchum et al., 2008;Wang et al., 2010Wang et al., , 2011, and elicits a dramatic and long-term suppression of host defences (reviewed by Hewezi, 2015;Mitchum et al., 2013;Noon et al., 2015). While the functional mechanisms of many effector proteins remain elusive, a variety of functions have been attributed to previously characterized effector proteins secreted from the esophageal gland cells of root-knot and cyst nematodes (Chronis et al., 2013;Elling & Jones, 2014;Eves-van den Akker et al., 2014;Gao et al., 2001Gao et al., , 2003Haegeman et al., 2012;Hamamouch et al., 2012;Hewezi, 2015;Hewezi et al., 2010;Lee et al., 2011;Maier et al., 2013;Noon et al., 2015;Patel et al., 2010;Vanholme et al., 2006;Wang et al., 2001Wang et al., , 2005. For example, a chorismate mutase protein, typically absent in animals, is secreted by root-knot and cyst nematodes to manipulate the plant host's shikimate pathway, a pathway involved in producing aromatic amino acids, plant hormones, cell wall components, and plant defence metabolites (Bekal et al., 2003;Lu et al., 2008;Noon & Baum, 2016;Vanholme et al., 2009). Signalling peptides, like CLAVATA3 plant peptide mimics, can affect plant developmental pathways (Lu et al., 2009;Olsen & Skriver, 2003;Replogle et al., 2011;Wang et al., 2005Wang et al., , 2010Wang et al., , 2011. While these effectors have led to a better understanding of plant-nematode interactions, only a small portion have been functionally characterized. Understanding the totality of effector proteins in the nematode genome and how they manipulate the host will shed light on this molecular interplay, inspiring the development of novel mechanisms to defend plants from these important pests. To accomplish this goal for the soybean cyst nematode, two annotated genome assemblies were published from two different nematode strains: a partially virulent TN10 line (Masonbrink et al., 2019) and a highly virulent X12 line (Lian et al., 2019). Here, we improve upon the existing published H. glycines genomes by reassembling the TN10 PacBio reads de novo and scaffolding with Chicago and Hi-C reads to obtain the highest quality plant-parasitic nematode genome assembly to date with nine complete pseudomolecule chromosomes and zero unscaffolded contigs. We went to great lengths to ensure the integrity of the assembly, by mapping input genome assembly reads back to the assembly and by using synteny in among H. glycines assemblies. While large rearrangements exist between the TN10 and X12 pseudomolecule assemblies, technological improvements in Hi-C scaffolding software (Lachesis vs. Juicer) revealed that these differences can be attributed to many small and a few large chromosomal misjoins in the X12 assembly. Although the X12 and this latest TN10 assembly have similar assembly metrics and size, 141 versus 158 Mb, choices in gene prediction created a large disparity in gene frequency between annotations. Here, we have attempted to bridge this gap with an extensive gene annotation that uses multiple prediction pipelines and lines of evidence to generate an annotation that is complete and comparable to other parasitic nematode species. We also limited our gene homology input to include only genes of related Tylenchida species to prevent the homology-driven over-simplification of gene structure when using more distant and nonparasitic relatives. Using this vastly improved genomic resource, we explore the nature of previously published effectors and other secreted proteins to address the heart of H. glycines genomics, to understand the adaptive evolution involved in the constant battle between host resistance and parasite virulence.

| Dovetail Chicago library preparation and sequencing
A Chicago library was prepared as described previously (Putnam et al., 2016). Briefly, ~500 ng of high molecular weight (HMW) gDNA (mean fragment length = 75 kbp) was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5' overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments.
The DNA was then sheared to ~350 bp mean fragment size, and sequencing libraries were generated using NEBNext ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 500 million 151 base paired-end reads.

| Dovetail Hi-C library preparation and sequencing
A Dovetail Hi-C library was prepared in a similar manner as described previously (Lieberman-Aiden et al., 2009). Chromatin was fixed in the nucleus with formaldehyde, extracted, and DpnII digested. 5' overhangs were filled with biotinylated nucleotides, and then free blunt ends were ligated. Crosslinks were reversed for DNA purification from proteins. Purified DNA was treated to remove biotin outside of ligated fragments. The DNA was then sheared to ~350 bp, and libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotinylated fragments were isolated with streptavidin beads before PCR enrichment of libraries and sequenced on an Illumina HiSeq X to produce 531 million 2 × 151 bp paired end reads.

| Genome assembly
To create a more reproducible H. glycines genome assembly, we used Falcon with previously deposited Pacbio sequencing (SRX2692203 -SRX2692222), rather than use the previous assembly (Masonbrink et al., 2019). Falcon unzip 0.4.0 (Chin et al., 2016) was then used to reduce the heterozygosity in the assembly. Dovetail Genomics scaffolded this assembly with Chicago and Hi-C reads using a modified SNAP read mapper (http://snap.cs.berke ley.edu) and an iterative assembly with HiRise. This Dovetail assembly was further scaffolded with the previously mentioned Pacbio subreads using sspace-longread v1.1 (Boetzer & Pirovano, 2014). gmcloser 1.6.2 (Kosugi et al., 2015) was used to fill gaps using PacBio circular consensus (CCS) reads. pilon 1.22 (Walker et al., 2014) was then used to polish the assembly using ~142 million 260 bp PE Illumina reads, which were processed with trim galore 0.4.5 (Krueger, 2012), hisat2 2.1.0 (Kim et al., 2015), and samtools 1.9 (Li et al., 2009). The genome was then scaffolded using the previously mentioned Hi-C reads using juicer 1.7.6 (Lieberman-Aiden et al., 2009), 3D-DNA 180419 (Dudchenko et al., 2017), and manually corrected using juicebox 1.9.8 (Durand et al., 2016). Once pseudomolecules were assembled, the genome was polished with pilon 1.23 (Walker et al., 2014) using CCS reads, then with the 142 million Illumina PE reads, and then with 38 iterations of polishing using Pacbio CCS reads with pilon 1.23. PacBio polishing iterations were discontinued when changes novel changes to the genome ceased, a measure taken to adjust for the large population of individuals sequenced and potential small contig orientation problems in Juicebox. A final round of polishing was performed with the 142 million Illumina 250 bp reads (SRR8381095) using pilon 1.23 (Walker et al., 2014).
To obtain the final gene counts, mRNAs were classified via repetitiveness and expression. mRNAs with at least a 30% overlap in two CDS sequences with a repeatmasker repeat were removed from the final gene prediction (8313). Another 643 transcripts were filtered from the annotation that had a functional containing the term, helitron, transposase, or transposon. These genes were placed in a separate annotation track, representing transposable elements. Another 11,268 predicted mRNAs with one or fewer unique mapping RNAseq reads were also were removed from the final annotation and placed in an annotation track, representing unexpressed genes.

| Functional gene annotations
Gene annotations were compiled from interproscan 5.27-66.0 (Finn et al., 2016) and blast (Madden, 2013) searches to NCBI NT and nr databases downloaded on 10-23-19 (Coordinators, 2016), as well as swissprot/uniprot databases downloaded on 12-09-2019 (Consortium, 2019). Genes encoding transposable element-associated proteins were identified using bedtools 2.27.1 (Quinlan, 2014) with exon overlaps to EDTA and Repeatmodeler-predicted transposable elements. TA B L E 1 Published effector locations and putative duplications between TN10 and X12 genome assemblies. These are nucleotide alignment with GMAP of 80 previously published effector genes. Secreted status of predicted effector proteins indicates the presence of a signal peptide and the absence of a transmembrane domain past the signal peptide cleavage site

| BUSCO analysis
Universal single copy orthologous genes were assessed using busco 5.12 (Seppey et al., 2019;Simão et al., 2015;Waterhouse et al., 2017) on both the predicted proteins and the genome against the nematoda ODB10 data set. In genome mode, --augustus and -long options were used to survey genomes with the nematodea_odb10 data set. Revised missing counts were determined by taking total missing counts from the consensus of both genome and protein BUSCO scans, divided by the total number of BUSCOs available (3131-666 = 2465).

| Repeat prediction
Multiple repeat predictions were pursued to finely detail genome structure. To comprehensively predict the structure of transposable

TA B L E 1 (Continued)
elements in the genome with Extensive de novo TE Annotator, edta 1.7 (Ou et al., 2019). tandem repeat finder 4.0.9 (Benson, 1999) was run on the genome to identify tandem repeats. A repeat prediction sensitive to copy number variation was also pursued with repeatmodeler 1.0.11 (Smit et al., 2014) and repeatmasker 4.0.9 (Smit et al., 2015).
By deriving gene orthology from primary mapping sites of the pre-

| Genome quality metrics
The H. glycines TN10 genome assembly comprises 2109 contigs, all of which were incorporated into the expected nine pseudomolecule scaffolds using the Juicer pipeline, in agreement with cytological observations (Cotten, 1965;Lian et al., 2019) (Figure 1). Our goal was to generate a genome using easy-to-replicate de novo methods to facilitate comparisons with future H. glycines assemblies, as the previous draft assembly utilized synteny among contigs to generate scaffold-size contigs (Masonbrink et al., 2019). With this approach we used Falcon to assemble and Falcon Unzip (Chin et al., 2016) to remove putative haplotigs from the assembly. Dovetail genomics scaffolded the genome with Chicago and HiC reads, reducing the assembly to 509 scaffolds (Table S1). This assembly was scaffolded further using Sspace (Boetzer & Pirovano, 2014) with PacBio subreads, gaps were filled with gmcloser (Kosugi et al., 2015), and pilon  (Table S1).
The genome size of the new assembly was 157,982,452 bp, within the expected range for this clade of species (Table 2); however, the genome size was 34 Mb larger than the previous TN10 assembly (  Figure 1a); however, we did not find a single repeat, but six different repeats enriched in these regions, including mutator like elements, hAT-like elements, a gypsy retroelement, and a putative MITE ( Figure S1). Another interesting phenomenon was the observation that the high density HiC signals were associated with regions of high gene density ( Figure 1B)

| Improvements over existing soybean cyst nematode assemblies
This pseudomolecule assembly is a massive step forward in the genomics of plant-parasitic nematodes, increasing the ability of nine pseudomolecules and 19 unscaffolded contigs (Lian et al., 2019). We further investigated synteny by performing whole genome alignments with minimap2 and dotplotly, thereby verifying the scaffolding successes of the new TN10 assembly (Figure 2a) and clarifying the differences in chromosomal structure between the TN10 and X12 pseudomolecule assemblies (Figure 2c,d).
Considering that more than 61 Mb of repeats are in the TN10 genome (38.6%), synteny to 41.5% and 51% the genome in the TN10 draft and X12 assemblies, respectively, is high.
Assignment of contigs to chromosomes was improved in this assembly compared to existing X12 assembly. These differences  Table S3). Surprisingly, after adjusting for these large chromosomal misjoins in X12, there were very few chromosomal rearrangements between the two lines of a highly adaptable species (Figure 2c,d).

| Gene annotation
The gene annotation resulted in 22,465 gene models, encoding 23,933 transcripts with an average gene length of 4569 bp, values that are comparable to related species (Table 3). While the frequency of genes is substantially larger than the previously published X12 annotation (11,882), the propensity for parasites to duplicate genes involved in host-parasite interactions requires a novel approach to gene prediction. To prevent the obliteration of parasitism genes thought to be maintained at high copies in the nematode, we developed an annotation approach to predict all transcribed elements in the genome, including repetitive elements. A genome without repeat masking was used to allow highly similar, high-copy number genes to be identified.
However, because repetitive elements frequently reside in noncoding regions of genes, multiple genome-guided transcriptomes and gene predictions enabled the dissection of high-confidence gene models (Table S4). This improvement in gene prediction is indicated by our total gene count (22,265), a figure similar to the previous TN10 draft when considering that 9000 repetitive elements were among the 29,769 genes examined in the previous assembly (Masonbrink et al., 2019). Our analyses included known parasitism genes (see Methods) and repeats missing from X12 (11,882) and produced a more highly F I G U R E 1 Hi-C and coverage plot of Heterodera glycines TN10 genome assembly. (a) Hi-C plot of the nine pseudomolecules in the TN10 genome. (b) Coverage plot of the TN10 genome assembly, Gene-based synteny between TN10 and X12 H. glycines genomes. From outside to inside: Yellow is gene density, purple is exon density, blue is Tandem Repeat Finder repeats, red is Repeatmasker and repeats, and green is EDTA annotated repeats TA B L E 2 Genome assembly and annotation statistics for H. glycines and related species. BUSCO was run on both Eukaryota_odb10 and Nematoda_odb10, using both genome and predicted proteins. Max missing is score computed by removing the 666 BUSCO genes from Nematoda_odb10, that were not found in any of the 8 genomes assessed   Abad et al. (2008) contiguous genome with fewer fragmented genes than the previous TN10 assembly (29,769). Our average gene and transcript lengths are the largest among the compared species, while exon count per transcript has also increased relative to earlier annotations of H. glycines and other related species (Table 3). Another line of evidence to support these gene predictions lies with the high proportion of genes that have functional annotations with 85.2% of predicted proteins or transcripts having homology to sequences in Interpro, Swissprot, NCBI NR, or NCBI NT databases (Tables S5 and S6).

| Effector gene prediction
With a high-quality pseudomolecule genome, we can now better understand the molecules that are secreted from the parasite into the host. Though signal peptides provide a reliable method to identify proteins entering the secretory pathway, a number of proteins with nonclassical secretion will be overlooked (Bendtsen et al., 2004;Gahoi & Gautam, 2017;Nielsen et al., 2019). However, to illustrate what can be predicted, we identified transcripts that produce transmembrane-lacking proteins with signal peptides. We identified 1514/23,933 transcripts that produce a secreted protein, originating from 1421 of the 22,465 genes (Table 4). A second step to identifying these molecules lies with attributing the previously published effectors to genes in the genome (Gao et al., 2003;Noon et al., 2015). Using DNA sequence similarity to previously published effectors, 100 potential effector genes were identified in the TN10 genome with a minimum query alignment and sequence identity of 50% (Table 1). We utilized the same parameters to discover effectors in the X12 genome, identifying 98 effector genes and the absence of 45D07 and 4D09 effectors (Table 1). Shifts in the frequency of effector duplication were evident among the two lines, with 12 effector duplications in TN10 and 14 duplications in X12 (Table 1). To expand this putative effector data set, we identified effectors in the predicted proteome of TN10. However, only 43/100 and 25/98 of these putative effectors encode secreted proteins (Table 1), indicating that genes may be variable among SCN lines in their propensity to produce secreted proteins, a variability that may contribute to SCN virulence.

| Expression of effector genes
In the hope of further resolving the genes important to the host-parasite exchange, we utilized existing H. glycines RNA-seq (SRP122521). All possible comparisons were made between preparasitic (i.e., before root penetration) second-stage juvenile nematodes (PP), second-stage parasitic (i.e., after root penetration) nematodes on a susceptible host (C for compatible), and secondstage parasitic nematodes on a resistant host (IC for incompatible) F I G U R E 2 Synteny plots of the TN10 pseudomolecule genome. (a) Nucleotide alignment of TN10 pseudomolecule genome to the TN10 draft. (b) Circos plot of synteny found through iAdHoRe TN10 pseudomolecule (blue) to TN10 draft genome (green). (c) Nucleotide alignment of TN10 pseudomolecule genome to the X12 genome. (d) Circos plot of iAdHoRe synteny between TN10 pseudomolecule and X12 genome (Data S1). These data were integrated into a tabular database of genes, functional annotations, sequences, and differential expression. Using this database, we filtered genes in the H. glycines genome for key traits of genes involved in parasitism, including differential expression at a p-value of 0.05, >1 log2 fold change, signal peptide presence, and the absence of a transmembrane domain. Of the 1421 genes assessed, 61 genes were differentially expressed between the PP and C, 392 genes between PP and IC, and 609 in novel comparisons between C and IC samples (Table 4). Among these comparisons, we assessed which genes may be involved in host nucleus reprogramming by annotating NLS signals in these differentially expressed genes. We only found four and 14 genes encoding proteins with NLS signals in the C and IC versus PP comparisons, respectively. However, comparisons between C and IC revealed 113 differentially expressed genes that were also secreted and nuclear targeted.
While effector genes upregulated in the preparasitic stages are likely to be associated with the migratory phase of parasitism, obtaining a list of candidate effectors for parasitic stages was less complete. Only four published effectors were upregulated in C versus PP, and three were upregulated in IC versus PP (Table 3). However, by comparing effector genes that were downregulated in parasitic IC samples but also upregulated in C samples versus PP, we discovered a number of effector genes that were downregulated when encountering resistance: 6E07 (Gao et al., 2003), 4G06 (ubiquitin extension) (Gao et al., 2003), 4D06 (Gao et al., 2003), three 45D07 type effectors (chorismate mutase) (Bekal et al., 2003), 30C02 (defence suppressor) (Hamamouch et al., 2012), four 2D01 type effectors (interacts with plant LRR) (Lin, 2016), 20E03 (Gao et al., 2003), 12H04 (Gao et al., 2003), 5A08 (RAN-binding, interacts with soybean LRRs (Quentin et al., 2013)), and Gland14 (endopeptidase) (Noon et al., 2015). While interesting, these decreases in expression could also be interpreted as the early stages of nematode death on a resistant host (IC), warranting further investigation into the mechanisms for these expression changes.

| Conclusion
In summary, we present the most complete H. glycines assembly, with a consensus gene prediction pipeline sensitive to the prediction of high-copy parasitism-related genes. We confirm this with a high percentage of synteny to previous assemblies, high read mapping rates, and the complete integration of all contigs into nine pseudomolecules. Using currently available data, we compiled a comprehensive resource that extensively annotates H. glycines genes, a critical resource for the development of advanced technology to combat this pest. This resource will be integrated into SCNBase.org, which fur-

DATA AVA I L A B I L I T Y S TAT E M E N T
All scripts and notes used to prepare this genome are available at https://github.com/ISUge nomic s/Dovet ail2S CNGenome. The genome, annotation, and Hi-C reads were uploaded to Genbank and SRA under the Bioproject PRJNA603076 and SRR8381095. All genome track data and annotations will also be hosted on SCNBase.
org. Once released, genomes and annotations will be uploaded to ParaSite.WormBase.org.

O RCI D
Rick E. Masonbrink https://orcid.org/0000-0002-6848-2144 TA B L E 4 Differentially expressed transcripts and genes with consecutive filtration by significant differential expression, greater than one log2fold change, signal peptide presence, and the lack of transmembrane domain. These filtered genes are further defined by the presence of a nuclear localization signal and if a gene was associated with a previously identified effector. C denotes a parastiic upregulation on a susceptible host, PP denotes upregulation in preparasitic stages, and IC denotes parasitic upregulation on a resistant host