Evolution of five environmentally responsive gene families in a pine‐feeding sawfly, Neodiprion lecontei (Hymenoptera: Diprionidae)

Abstract A central goal in evolutionary biology is to determine the predictability of adaptive genetic changes. Despite many documented cases of convergent evolution at individual loci, little is known about the repeatability of gene family expansions and contractions. To address this void, we examined gene family evolution in the redheaded pine sawfly Neodiprion lecontei, a noneusocial hymenopteran and exemplar of a pine‐specialized lineage evolved from angiosperm‐feeding ancestors. After assembling and annotating a draft genome, we manually annotated multiple gene families with chemosensory, detoxification, or immunity functions before characterizing their genomic distributions and molecular evolution. We find evidence of recent expansions of bitter gustatory receptor, clan 3 cytochrome P450, olfactory receptor, and antimicrobial peptide subfamilies, with strong evidence of positive selection among paralogs in a clade of gustatory receptors possibly involved in the detection of bitter compounds. In contrast, these gene families had little evidence of recent contraction via pseudogenization. Overall, our results are consistent with the hypothesis that in response to novel selection pressures, gene families that mediate ecological interactions may expand and contract predictably. Testing this hypothesis will require the comparative analysis of high‐quality annotation data from phylogenetically and ecologically diverse insect species and functionally diverse gene families. To this end, increasing sampling in under‐sampled hymenopteran lineages and environmentally responsive gene families and standardizing manual annotation methods should be prioritized.


| INTRODUC TI ON
Multigene families are a potentially important source of evolutionary innovation.When gene families grow via duplication, for example, reduced functional constraints may facilitate the development of phenotypic novelty (Demuth & Hahn, 2009;Ohno, 1970).
Reductions in gene family size can also give rise to novel traits.For example, the colonization of highly specialized niches like oligotrophic caves (Gross et al., 2009;Protas et al., 2006;Yang et al., 2016) and toxic host plants (Good et al., 2014;Matsuo et al., 2007;McBride, 2007) is linked to rampant pseudogenization.Together, these observations suggest that some gene families evolve predictably in response to specific selection pressures.Yet compared to the rich and growing literature on genetic convergence at individual loci (Martin & Orgogozo, 2013), the repeatability and predictability of gene family evolution remains understudied (but see Beavan et al., 2023;Thomas et al., 2020).
The evolution of many gene families, defined here as groups of genes that share sequence and functional similarity from common ancestry (Dayhoff, 1976;Demuth & Hahn, 2009), is consistent with a birth-death model where genes arise via duplication (gene gain) and lost via deletion or pseudogenization (gene loss) (Hughes & Nei, 1992;Nei & Rooney, 2005).When duplications and deletions evolve primarily through genetic drift, over time gene family sizes contract and expand via a process dubbed genomic drift (Nei, 2007;Nozawa et al., 2007).Overall, the stochastic birth-death process of genomic drift (which differs from Nei's conceptual birth-death model of gene family evolution; Hahn et al., 2005) sufficiently explains most gene family size distributions within genomes (Karev et al., 2002) and between species (Hahn et al., 2007).
But during an ecological shift, natural selection can influence birth-death dynamics by driving the expansion or contraction of specific gene families.Thus, taxa adapted to a novel niche may have genomic evidence of selective maintenance for gene duplications or deletions.For example, if selection favors the retention of additional gene copies, novel gene duplicates will tend to persist in the genome, increasing the total number of genes.If the mutational mechanism that generates new duplicate genes is unequal crossing over during meiosis, these recently diverged paralogs will be arranged in tandem arrays across the genome (Zhang, 2003).
Moreover, if duplicates experience positive selection for novel functions, they can have elevated amino acid substitution rates.
Conversely, some genetic functions may become obsolete or even deleterious in the novel habitat.In this case, relaxed purifying selection or positive selection will cause some gene families to accumulate loss-of-function mutations at an accelerated rate (Go et al., 2005).After an ecological shift, impacted gene families will eventually reach a new equilibrium state where gene number returns to evolving primarily through negative selection and genomic drift, and pseudogenes fade into the genomic background as they accumulate neutral substitutions (Petrov et al., 1996;Petrov & Hartl, 1997, 1998).Thus, some footprints of adaptive changes in gene family size are likely ephemeral and best detected in lineages that recently shifted to a novel niche.
Arguably, the gene families most likely to evolve in response to niche shifts are those that mediate organismal interactions with their biotic and abiotic environments.These "environmentally responsive genes" include those with chemosensory (e.g., olfactory and gustatory receptors), detoxifying (e.g., cytochrome P450), and immunity (e.g., immunoglobulin and MHC) functions.To cope with constantly changing pressures, environmentally responsive genes tend to be characterized by elevated sequence diversity, duplication rates, substitution rates, and genomic clustering, as well as tissue-or temporal-specific expression (Berenbaum, 2002) and limited pleiotropy (Arguello et al., 2016).Importantly, causal links between changes in environmentally responsive genes and adaptation to novel niches have been established for multiple taxa (Armisén et al., 2018;Després et al., 2007;Dobler et al., 2012;Luo et al., 2020;Matsuo et al., 2007;Sezutsu et al., 2013;Zhen et al., 2012).
With exceptionally diverse ecologies and an ever-increasing availability of annotated genomes (Hotaling et al., 2021;i5K Consortium, 2013;Poelchau et al., 2015), insects are a powerful system for investigating the extent to which environmentally responsive gene families evolve predictably in response to ecological challenges.To date, at least two ecological transitions are hypothesized to have a predictable impact on gene family evolution in insect lineages.In plant-feeding insects, the evolution of increased dietary specialization (i.e., smaller diet breadth) is associated with reduced chemosensory and detoxifying gene family sizes and, for intact genes, elevated rates of nonsynonymous substitutions (Calla et al., 2017;Comeault et al., 2017;Goldman-Huertas et al., 2015;Good et al., 2014;McBride, 2007;McBride & Arguello, 2007) but see (Gardiner et al., 2008).In hymenopteran insects, eusociality is associated with expansions of the olfactory receptor family and contractions of the gustatory receptor family (Brand & Ramírez, 2017;McKenzie et al., 2016;Robertson & Wanner, 2006;Zhou et al., 2015; but see Fischman et al., 2011;Johnson et al., 2018).However, biased sampling in which insect lineages (especially Drosophila and apocritan Hymenoptera) and gene families (especially the olfactory receptor and cytochrome P450 gene families) are studied makes it difficult to draw general conclusions about evolutionary patterns.A better understanding of ecology and gene family size relationships requires a sampling of evolutionarily independent ecological transitions and functionally diverse gene families.To these ends, we characterize multiple environmentally responsive gene families in the genome of the redheaded pine sawfly, Neodiprion lecontei (Order: Hymenoptera; Family: Diprionidae).
Neodiprion is a genus of conifer-feeding sawflies (Order: Hymenoptera; Family: Diprionidae).All species (~50 described to date; Linnen & Smith, 2012;Wallace & Cunningham, 1995) are restricted to host plants in the family Pinaceae; most are found exclusively on plants in the genus Pinus.Because most members of the genus are economically important pine tree pests (Arnett, 1993), the life histories of many Neodiprion species have been studied in great detail, providing extensive information on host use, behavior, morphology, and development (Atwood & Peck, 1943;Coppel & Benjamin, 1965;Knerer & Atwood, 1973).In addition to being well studied, Neodiprion are abundant in nature, can be reared and crossed under laboratory conditions (Knerer, 1984;Kraemer & Coppel, 1983;Linnen et al., 2018), and vary in many ecologically important traits (e.g., host range, larval color, grouping behavior, overwintering mode) (Knerer, 1993;Knerer & Atwood, 1973).In terms of host use, for example, different Neodiprion species specialize on different subsets of pine species with some species gaining the use of certain pine hosts and others losing the ability to use those same hosts (Linnen & Farrell, 2010).Together, these features make Neodiprion an excellent system for uncovering the molecular mechanisms and evolutionary processes that generate phenotypic variation.However, to realize the full potential of this promising model system, genomic resources such as annotated reference genomes are needed.
Beyond the development of a novel model system, a draft genome for N. lecontei contributes a useful data point for comparative genomic analyses in two ways.First, although many assembled and annotated hymenopteran genomes are currently available, almost all are apocritan (ants, bees, and wasps, but see Falk et al., 2022;Michell et al., 2021;Oeyen et al., 2020;Robertson et al., 2018) for some recent exceptions).As the first draft genome from the symphytan family Diprioinidae, N. lecontei increases the ecological, behavioral, and taxonomic diversity of hymenopteran genomes for evaluating ecological correlates of gene family size and other aspects of genome evolution.Second, N. lecontei is an exemplar of an herbivorous insect lineage that underwent a drastic host shift.Sometime within the last 60 million years, the ancestor to extant diprionids transitioned from angiosperms to coniferous host plants in the family Pinaceae (Boevé et al., 2013;Peters et al., 2017).To defend against herbivores and pathogens, Pinaceae produce viscous oleoresin secretions with unique antimicrobial properties (Gershenzon & Dudareva, 2007;Trapp & Croteau, 2001).To manage these toxic and extraordinarily sticky resins, N. lecontei and related diprionids evolved specialized feeding and egg-laying traits (Figure 1).Thus, beyond these traits, we hypothesize that pine specialization likely resulted in strong selection on multiple gene families, especially those involved in chemosensation, detoxification, and immune function.
Here, we describe the N. lecontei draft genome and compare it to other available hymenopteran and pine-specialist genomes.To investigate gene families that may have contributed to pine adaptation, we manually annotated genes from five environmentally responsive gene families: olfactory receptor (OR), gustatory receptor (GR), odorant-binding protein (OBP), cytochrome P450 (CYP), and antimicrobial peptide (AMP).For each gene family, we characterized (1) the number of genes, (2) the proportion of pseudogenized genes, (3) the extent of genomic clustering, (4) evolutionary relationships with hymenopteran orthologs, and (5) patterns of molecular evolution among recent paralogs.Based on these patterns, we identify candidate gene families that may have facilitated a shift from angiosperm feeding to pine-feeding and evaluate how gene family size in Diprionidae compares to other manually annotated hymenopteran genomes.
To measure assembly completeness and artificial sequence duplication, we used CEGMA (Parra et al., 2007) and BUSCO (Simão et al., 2015).Both search the assembly for a set of single-copy, conserved genes; however, the CEGMA software has been deprecated (http://korfl ab.ucdav is.edu/Datas ets/cegma).Of the 248 CEGMA core eukaryotic genes, 90% aligned as complete, single copies and 8% aligned complete but duplicated.For BUSCO, we used the Or-thoDB arthropod dataset, and out of 2675 groups 77% were complete, single copies and 3% were complete but duplicated.These metrics indicate the presence of artificial duplicate sequences, but otherwise the assembly was reasonably complete and suitable for annotation.
About 16% of the assembly consisted of repetitive elements, including 122 unknown transposable elements mostly unique to N. lecontei, and 212 other families of transposable elements and simple repeats (Table A3).This 16% corresponds to 12% of the actual 331-Mb genome, of which we predict 28% is repetitive, suggesting that ~16% of the missing ~28% of the genome is repetitive content (Table A3).Overall, the repetitive element content is consistent with other sawfly genomes assembled from Illumina short reads (Cephus cinctus; Robertson et al., 2018), Orussus abietinus (Oeyen et al., 2020), but see Athalia rosae (Oeyen et al., 2020).

| Olfactory receptors
The OR gene family had 56 genes total, including the co-receptor Orco; one gene contained stop codons and three were partial annotations, leaving 52 genes intact (Table 1).In D. melanogaster, most olfactory sensory neurons (OSNs) express a single OR (along with Orco).Furthermore, OSNs that express a particular OR all converge on the same glomerulus in the antennal lobe (Couto et al., 2005;Gao et al., 2000;Vosshall et al., 2000;but see Fishilevich & Vosshall, 2005), resulting in a general one-to-one anatomy between the number of ORs and the number of glomeruli, correspondence also observed in the hymenopteran European honeybee (Apis mellifera; Robertson & Wanner, 2006).Based on these studies and examination of adult male and adult female N. lecontei antennal lobes, we expected to find a minimum of 49 functional ORs (Figure 2, Table A5).The observed size of the Neodiprion lecontei OR gene family exceeds this minimum, is comparable to other herbivorous sawflies, and is much smaller than that of ants, bees, and many wasps (Figure 3, Table A8).Thus, our data are consistent with the hypothesis that eusocial hymenopterans have unusually large OR families, possibly due to selection stemming from complex chemical communication (LeBoeuf et al., 2013;Robertson & Wanner, 2006;Zhou et al., 2015;but see Brand & Ramírez, 2017;Roux et al., 2014).59% of ORs were in genomic clusters of two or more genes (Figure 4), a low proportion compared to many other hymenopteran OR families (Brand & Ramírez, 2017;McKenzie et al., 2016;Robertson & Wanner, 2006;Zhou et al., 2015).A phylogenetic analysis of OR protein sequences from Neodiprion lecontei, two other sawfly species, three apocritan Hymenoptera, and D. melanogaster identified three Neodiprion-specific clades of at least four genes (Table 2, Figure A1); these were also recovered in a phylogenetic analysis of Neodiprion OR cDNA sequences (Figure A2).All these clades were found in genomic clusters mixed with other OR genes (Figure 4).For the Neodiprion-specific OR clades (and other Neodiprion-specific clades, see below), we used the Neodiprion gene family cDNA tree, the codeml program in the PAML package (Yang, 2007), and likelihood-ratio tests to ask: (1) for the focal OR clade if the ratio of nonsynonymous to synonymous substitution TA B L E 1 Summary of gene family size, genomic clustering, and patterns of molecular evolution for five environmentally responsive gene families.Abbreviations: AMP, antimicrobial peptide genes; CYP, cytochrome P450 genes ("clans" refer to four major clades of CYPs present in insects); GR, gustatory receptor genes; OBP, odorant-binding protein genes; OR, olfactory receptor genes.

Genomic
a Calculated as: (number of genes in clusters of 2 or more)/(genes for which clustering could be evaluated).
b Defined as monophyletic clusters of 4 or more Neodiprion paralogs in an amino acid phylogeny constructed with gene annotations from Neodiprion, select Hymenoptera (including additional sawflies), and Drosophila melanogaster. c To be counted, clades had to reject both 1-ratio and fixed-ratio models in dN/dS branch tests (see Table 2). d To be counted, clades had to reject both M7 and M8a models in dN/dS site tests (see Table 2). e One gene was both a partial annotation and a pseudogene.

| Gustatory receptors
The GR gene family had 44 genes total; two genes contained stop codons, two were partial annotations (one annotation was Horizontal lines indicate the approximate locations of genes within LG; diagonal lines that connect to horizontal lines are used to highlight groups of genes that met our clustering criteria.Genes that were found on scaffolds that have not been placed on linkage groups are indicated on the bottom left, with abbreviated scaffold names given in parentheses (e.g., S-210 = scaffold_210 = LGIB01000210.1 in the assemblies available on NCBI).
TA B L E 2 Likelihood-ratio tests (LRTs) of positive selection on Neodiprion-specific clades (branch models) and on amino acid sites within these clades (site models).both partial and pseudogenized), and 41 were intact (Table 1).In contrast to the OR gene family, the N. lecontei GR family size was larger than other sawflies, and considerably larger than most bees (Figure 3, Table A8); still, the largest hymenopteran GR families are among several ant species.Overall, the patterns of GR family size variation in Hymenoptera do not appear associated with eusociality, as suggested by (Robertson & Wanner, 2006).But given the pronounced variation among taxa, it is possible that other ecological transitions-like changes in diet breadth or specialization on specific niches-have favored GR family expansions or contractions in certain hymenopteran lineages.Additional data (i.e., ecological character states and GR annotations from diverse lineages) are needed to test this hypothesis.
76% of the GRs that could be placed on chromosomes were in genomic clusters (Figure 4) including the three Neodiprion-specific GR clades (Table 2, Figures A3 and A4).GR clade 3 had a discrepancy between the hymenopteran GR protein tree and Neodiprion cDNA tree where NlGR16 was part of the clade in the N. lecontei cDNA tree but not in the hymenopteran amino acid tree; analyses with and without GR16 had similar results.Genomic clustering was evident in all three Neodiprion-specific GR clades; however, some of the genes in clades 2 and 3 were clustered together, suggesting shared ancestry from an older duplication event (also see Figure A3).
Clade 1 had significant values for all four tests of selection (branch and site), clade 2 lacked evidence for positive selection, and clade 3 had evidence of branch and possibly site positive selection (Table 2).
Notably, GR Clade 1 (which had strong evidence for branch and site positive selection) is an expansion of five paralogs orthologous to DmGR33a, a co-receptor required for caffeine detection.In Drosophila, three GRs are known to be required for detecting caffeine (a bitter-tasting deterrent compound): DmGR93a, DmGR66a, DmGR33a (Lee et al., 2009;Moon et al., 2006Moon et al., , 2009)).DmGR93a is a fine-tuned receptor more specific for caffeine while DmGR66a and DmGR33a are broad tuned for a wide range of avoidance compounds (Shim et al., 2015).Because N. lecontei has a paralogous expansion of only the broadly tuned DmGR33a ortholog and not the two other caffeine co-receptors, it may be the case that N. lecontei does not have an increased sensitivity to caffeine but rather an increased sensitivity to a broader range of bitter compounds.

| Odorant-binding protein
The OBP gene family had 13 intact genes total; pseudogenized or partial annotations were not identified (Table 1).This family size is on par with other hymenopterans, but it is important to note that insect OBP annotations are sparse compared to OR and GR annotations (Figure 3, Table A8) so interesting family size variation may be overlooked.It is apparent, however, that except for an unusually large family in Nasonia vitripennis, family size is much less variable across taxa, suggesting that OBP gene family evolution is more constrained (Vieira et al., 2007;Vieira & Rozas, 2011).Unlike the other chemosensory gene families in this study, phylogenetic grouping b Putatively functional genes.Pseudogenes and partial annotations were excluded from analysis.
c Site models unshaded; M7 and M8a do not allow for positive selection.Branch models shaded; 1-ratio estimates a single ω value for all branches, 2-ratio estimates a separate ω value for the foreground branch, 2-ratio neutral fixes ω = 1 for all branches.
e Bolded values are significant at critical value 0.05.
f 2-ratio model values (foreground branch, rest of tree).
g Amino acid sites under positive selection (from M8 Bayes Empirical Bayes analysis p > 95%).
h Reported values are from analyses without NlGR16.

TA B L E 2 (Continued)
failed to identify any Neodiprion-specific OBP clades (Figure A5) and only 38% of OBP genes were in genomic clusters, including a five gene cluster on chromosome 6 (Figure 4) that was not monophyletic (Figure A6).OBP lineage-specific expansions and genomic clustering are common in insects including Hymenoptera (e.g., Forêt & Maleszka, 2006;Jiang et al., 2017;Vieira et al., 2012;Vogt et al., 2015).We note, however, that our OBP phylogenies had low bootstrap support (Figures A5 and A6), making it difficult to infer paralogous relationships.
To compare, the wheat-stem sawfly Cephus cinctus also had a similar gene family size (15 genes) and lack of species-specific clades (Figure A5), but 63% of genes were in genomic clusters (Robertson et al., 2018).Two nonmutually exclusive explanations for the lack of OBP clustering in the N. lecontei genome are: (1) existing genes are in genomic regions with low duplication rates (Langley et al., 2012) and/or (2) OBP gene duplications tend to be removed by purifying selection.

| Cytochrome P450
The CYP gene family had 107 genes total; 12 genes contained stop codons, one was a partial annotation, and 94 were intact (Table 1).The number of intact CYP genes in N. lecontei is higher than many wasp and ant species and considerably higher compared to bee species (Figure 3, https://doi.org/10.5061/dryad.n8pk0p320).Thus, a reduced complement of CYPs may be unique to bees.But without CYP annotations from other sawflies, it is unclear if N. lecontei has an unusually large family size for a hymenopteran.In insects, CYPs belong to four major clades, referred to as clans (Feyereisen, 2012).When split by clan, the CYP2 clan had nine intact genes; the CYP3 clan had 47 intact genes and eight pseudogenes; the CYP4 clan had 28 intact genes, four pseudogenes, and one partial gene, and the mitochondrial CYP clan had 10 intact genes (Table 1, Figures A7 and A8).Across all CYPs, 66% were in genomic clusters (Figure 4).Looking at the four major clans separately, the percentage of clustered genes were: 33% for CYP2, 81% for CYP3, 55% for CYP4, and 50% for mitochondrial CYP (Figure 4).
The CYP gene family had five Neodiprion-specific clades with at least four genes (Figures A7 and A8), four of which were in the CYP3 clan.Of these, CYP clades 3 (CYP6 subfamily) and 5 (CYP336) had evidence of branch-specific, but not site-specific, positive selection (Table 2).Several studies to date suggest that members of the CYP3 clan-and the CYP6 subfamily in particular-play an important role in detoxifying pesticides and host-plant allelochemicals (Feyereisen, 2012).
In clan 2, CYP303A1 is involved in mechano-and chemosensory bristle development (Willingham & Keil, 2004).It is found across winged insects and has a highly conserved length of 498 + 4 amino acids except in Hymenoptera, where orthologs have an insertion that increases the length to 562 amino acids in A. mellifera and 587 in N. vitripennis (Dermauw et al., 2020).The N. lecontei ortholog had 578 amino acids, suggesting that the CYP303A1 insertion is ancestral in Hymenoptera.

| Immunity
As part of the innate immune system, antimicrobial peptides (AMPs) are expressed upon infection to kill or inhibit microbes.Based on hymenopteran sequences, the N. lecontei AMP gene family had 21 genes (Table 1), including single copies of Hymenoptaecin, Abaecin, and Tachystatin, but not a clear Defensin ortholog (Tables A6 and A9).
Eighteen Hisnavicin genes were identified, including a Neodiprionspecific expansion of 10 histidine-rich paralogs orthologous to cuticle proteins in Harpegnathos saltator (ant) and to cuticular protein precursors in Apis mellifera (honeybee) (Figure A9).The N. lecontei Hisnavicins had a conserved 62 amino acid motif that appeared up to 19 times in a single protein (Figure A11); the purpose of this amplification is unknown.95% of the AMPs were in genomic clusters (Figure 4).To date, this is the second largest AMP family documented in Hymenoptera: only Nasonia vitripennis has more AMPs than N. lecontei (44 vs. 21) (Tian et al., 2010).However, relatively few hymenopteran genomes have AMP annotations, making it difficult to draw firm conclusions about AMP family size variation across Hymenoptera (Table A9).Due to the low bootstrap support of many branches in our Hisnavicin protein tree, we could not identify unambiguous Neodiprion-specific clades (Figure A9).However, our Neodiprion cDNA tree (Figure A10) had strong support for a monophyletic cluster of eight Hisnavicins that were part of a larger cluster on linkage group 5 (Figure 4); the eight gene cluster had some evidence of non-neutral evolution; however, the 2-ratio foreground branch ω value was less than the background value (Table 2).
Outside of the AMP family, most immune pathways had direct orthologs with D. melanogaster (Figure A12, Table A7).The basic viral siRNA response pathway was completely conserved between species.The immune deficiency (IMD) pathway was missing an ortholog for the peptidoglycan recognition receptor PGRP-LC, but it is likely that another PGRP replaced PGRP-LC in N. lecontei; assigning PGRP orthology was also difficult in ants (Gupta et al., 2015).
Also missing is the Drosophila mitogen activated protein kinase kinase kinase, TGFβ activated kinase 1 (Tak1), but N. lecontei had a similar TGFβ activated kinase that is a close ortholog to several

| Limitations of assembly and annotation for interspecific analyses
Comparative genomics is a powerful approach for evaluating the repeatability and predictability of evolutionary outcomes; however, the comparisons are only as good as the underlying data.The draft genome presented here was made solely from small-insert libraries with Illumina HiSeq short reads; these types of assemblies are more fragmented with hundreds to thousands of scaffolds (Table A2).
Fragmentation increases the possibility of missing or partial gene annotations since genes may be split across scaffolds, which was the case for the N. lecontei olfactory receptor genes NlOR46joi, NlOR-54joi, NlOR56joi; missing gene annotations would affect our tests for positive selection and interpretations of genomic clustering.In fact, during the writing of this paper a new N. lecontei genome made with sequencing technologies that supplement short reads with long reads (Korlach et al., 2017) and reveals structural information (Peart et al., 2021) became available (Herrig et al., 2023).This provides a future opportunity to revisit these limitations and assess how assembly method and quality affects conclusions about gene family evolution.
Besides genome assembly continuity, gene annotations themselves may be difficult to compare directly.Across studies that included manual annotations, we observed a lack of consistency in the methods and criteria for manually curated gene family datasets.The most problematic inconsistencies were the criteria for delineating intact, partial, and pseudogenized gene annotations."Intact" could mean an exon-by-exon check against closely related orthologs, a minimum amino acid length, or simply the presence of an expected domain.Furthermore, the number of pseudogenized and partial annotations were not always reported or were conflated.This is in addition to variation in the methods used to search for genes.Inconsistency in annotation methods and criteria across studies may introduce taxon-specific biases in gene number unrelated to natural selection.

| The role of gene family evolution in pine specialization
During a niche shift, novel selective pressures may favor the gain or loss of genes within environmentally responsive gene families.For example, pseudogenization and gene loss have been documented in diverse host-specialized taxa and in multiple gene families (Cao et al., 2014;Goldman-Huertas et al., 2015;McBride, 2007;Smadja et al., 2009;Suzuki et al., 2018).Neodiprion lecontei feeds on a single genus of host plants (Pinus) and is an exemplar of a family that shifted from angiosperms to coniferous host plants sometime in the last 60 million years (Peters et al., 2017).When this shift occurred, some genes important for adaptation to angiosperm hosts may have experienced relaxed selection or positive selection for loss-of-function mutations.A priori, one gene family for which we expected evidence of gene loss was the AMP family.Our rationale was that the presence of antimicrobial and fungicidal compounds in pine resin (Cowan, 1999;Gershenzon & Dudareva, 2007;Grayer & Harborne, 1994;Himejima et al., 1992) may have led to relaxed selection on genes involved in immunity.Moreover, because immunity is costly (Sheldon & Verhulst, 1996), selection may have favored a reduced innate immune response in pine feeders.In support of our logic, honeybees (Apis mellifera) exposed to plant resin have reduced expression of immune-related genes (Simone et al., 2009) and wood ants (Formica paralugubris) that use conifer resin as building material have slightly reduced inducible immune system activity and nests with lower bacterial and fungal loads (Castella et al., 2008).In Diptera, AMP loss is associated with herbivorous lineages that live within host tissue, a more sterile habitat than is experienced by most dipterans (Hanson et al., 2019).
Consistent with our prediction that there would be AMP loss in N. lecontei, we were unable to find a clear ortholog of Defensin, an AMP gene present in all dipterans (Hanson et al., 2019) and almost all hymenopterans (Table A9) tested to date.However, the most striking pattern we observed in the AMP family in Neodiprion lecontei was a putative expansion of ~18 Hisnavicin-like AMPs.One interpretation of this lineage-specific proliferation of AMP genes is that a shift to conifers favored gene gain.Although we did not see evidence of positive selection among Hisnavicin-like paralogs (Table 2), it remains possible that novel selection pressures associated with pines-perhaps a community of pathogens unique to pines-favored the retention of Hisnavicin duplicates because they produced a beneficial increase in gene dosage (e.g., Perry et al., 2007).However, additional data are needed to confirm that Hisnavicin orthologs act as AMPs in N. lecontei and to characterize AMPs in closely related symphytan families to verify that this expansion is unique to coniferfeeding sawflies.
Outside of the AMPs, we detected pseudogenes in the OR, GR, and CYP families.However, the prevalence of pseudogenes was modest in comparison to other host-specialized insect taxa.
For example, in the N. lecontei chemosensory protein gene families (ORs, GRs, and OBPs)-the families most often associated with specialization-associated gene loss (Goldman-Huertas et al., 2015;Matsuo et al., 2007;McBride, 2007;McBride & Arguello, 2007)we found that only 0%-5% of genes had clear evidence of a loss-of-function mutation.One explanation is that the pseudogenization events that accompanied adaptation to coniferous plants are no longer detectable in the Neodiprion lecontei genome.For example, in Drosophila, pseudogenes have an estimated half-life of ~14.3 million years (Petrov et al., 1996;Petrov & Hartl, 1997, 1998).Alternatively, the prevalence of host-associated pseudoge-  A3, A7, A9).Many of these closely related paralogous groups are also clustered in the genome, suggesting they originated via tandem duplication (Figure 4).The largest putative expansion was the group of 15 Hisnavicin-like AMPs discussed above.However, we also observed putative expansions in the OR, GR, and CYP families that showed evidence of positive selection (Table 2).
One Neodiprion-specific GR clade with evidence of positive selection-GR clade 1-is an expansion of six paralogs (one is pseudogenized) orthologous to DmG33a, one of three co-receptors required for caffeine detection (Lee et al., 2009;Moon et al., 2006Moon et al., , 2009)).However, orthologs were not found for DmGR93a (Lee et al., 2009) and DmGr66a (Moon et al., 2006).Interestingly, in Drosophila, DmGR93a is a fine-tuned receptor with a higher specificity for caffeine while DmG33a and DmGr66a are more broadly tuned and participate in the detection of other bitter compounds (Shim et al., 2015); bitter compounds are usually interpreted as a deterrent signal (Yarmolinsky et al., 2009).Nevertheless, honeybees, which also lack clear orthologs to these putative co-receptors (Robertson & Wanner, 2006), can detect and even prefer low concentrations of caffeine and nicotine (Singaravelan et al., 2005 but see de Brito Sanchez, 2011).Although pines do not contain caffeine, they synthesize alkaloids that could confer bitterness (Mumm & Hilker, 2006).Thus, despite not having paralogous expansions for all the caffeine receptors, members of GR clade 1 may still be involved in the detection of pine-specific bitter compounds.Duplications of putative bitter GRs are documented in other host-specialized insects, such as Heliconius, Danaus, and Bombyx butterflies and other lepidoptera (Briscoe et al., 2013;Engsontia et al., 2015;Wanner & Robertson, 2008).Together, these observations lend support to the hypothesis that GR bitter receptors are frequently involved in plant-feeding insect host shifts and host specialization.
Because pines contain toxic components like terpenoids and phenolics, detoxifying gene families are also promising candidates for pine adaptation.The mountain pine beetle (Dendroctonus ponderosae), feeds on pine bark and wood and has gene "blooms" (species-specific paralogous gene expansions) in the CYP3 and CYP4 clans (Keeling et al., 2013).Similarly, the CYP family in N. lecontei had five blooms (Table 1, Figure A7): four CYP3 and one CYP4.CYP3 blooms are also found in wood-feeding insects that do not use pine, such as the emerald ash borer (Agrilus planipennis)  2) is from the CYP6 subfamily, linked to host plant adaptation in several insect taxa (Feyereisen, 2012;Li et al., 2003Li et al., , 2007;;Mittapelly et al., 2019).

| SUMMARY AND CON CLUS IONS
The gene families-that is, families other than chemosensory genesare also needed.To maximize signal: noise ratio across diverse taxa and genes, rigorously observed standardized protocols for annotation conventions are sorely needed (Klimke et al., 2011).Together, these data will make it possible to determine the extent to which certain gene families expand and contract predictably in response to ecology.

| Biological material
To minimize the confounding effects heterozygosity has on genome assembly, we sequenced haploid siblings.Like all Hymenoptera, reads: the small-insert libraries each had ¼ of a flow cell lane and the mate-pair library had an entire lane.

| mRNA
The RNeasy Mini extraction kit (Qiagen) was used to collect total RNA from adult female body, adult female head, adult male body, adult male head, eonymph body, feeding larval body, and feeding larval head.RNA from the eonymph head was extracted but not sequenced due to insufficient yield.Each tissue was represented with one replicate that had equal RNA contributions from eight individuals, except for the eonymph body which was comprised of three individuals.RNA integrity and concentration were measured with a 2100 Bioanalyzer (Agilent).
The HudsonAlpha Genomic Services Lab (Huntsville, AL, USA) handled library preparation and sequencing.Nonstranded, barcoded libraries were made for each of the seven tissue samples; on average, mRNA was sheared to 200 bp.The libraries were combined and sequenced on an entire flow cell of Illumina MiSeq with PE250 reads in addition to one lane of Illumina HiSeq 2000 with PE50 reads.
Due to sequencing quality, the 864-bp small-insert reads were filtered to retain reads where at least 70% of the bases had a quality score of 20 or higher (R1) or 60% (R2) (parameters: -q 20 -p 60/70).In situations where only one end of the paired-end reads passed filtering, the passed reads were kept and treated as single-end data.Kmer counting was used to measure read depth before and after filtering (Jellyfish v1.reads were mapped to the assembly scaffolds and scaffolds with a read depth <15 and nucleotide percentage <40 were removed. The completeness of the final assembly was measured with CEGMA (v2.5) (Parra et al., 2007) and BUSCO (v1.22) (Simão et al., 2015) benchmarks.BUSCO was run with the arthropoda-25oct16 database (parameters: --long).

| Genome size estimation
Flow cytometry was described in (Harper et al., 2016).Mbp).To correct for ploidy differences between haploid males and diploid standards, we multiplied the N. lecontei male estimates by 2.
To obtain a single size estimate for each N. lecontei sample, we averaged values obtained for the two standards.
The sequencing reads were mapped to a concatenation of the masked genome and the consensus TE sequences (BWA MEM (parameters: -M; Li & Durbin, 2009).Families that had at least 1× the median coverage to the reference genome for at least 80% of their sequence (to support at least one full insertion found by Repeat-Modeler) and at least 2x the maximum coverage of the reference genome (to support multiple insertions of the family) were extracted with genomeCoverageBed (BEDtools; Quinlan & Hall, 2010).We attempted to identify the consensus sequences with BLASTN and BLASTX (Altschul et al., 1990) searches against a database of repeat elements, but the only hits were to the lineage-specific elements identified by RepeatModeler.Sequences were also filtered for BLAST hits to rRNA or mitochondrial sequences.
We also used dnaPipeTE (Goubert et al., 2015) to identify what proportion of our short reads was composed of repetitive content, we used a random subset of reads corresponding to 1-fold coverage of the genome (331 Mb) and took the total for three separate random samplings of reads (parameters: genome size = 331,000,000 genome coverage = 1 samples number = 3).We then compared this annotation to the RepeatModeler annotation.to Maker (2.09) (Cantarel et al., 2008) as evidence for structural gene prediction.Prior to annotation, the genome was masked using a custom repeat database built using RepeatModeler (v1.0.8) and the annotation was run using the ab initio gene predictors Augustus, Genemark-ES and snap in addition to the evidence provided.The functions of the predicted protein-coding genes were putatively established with BLASTP alignments (Altschul et al., 1990) to the Swiss-Prot database (accessed 20 April 12) (Apweiler et al., 2004).In cases of multiple matches, the top-ranked alignment was assigned to the gene annotation.Protein motifs and functional domains within the annotations were also identified with an InterProScan (v5.3.46.0) (Jones et al., 2014) search against the InterPro database with gene ontology and IPR lookup (Finn et al., 2016).For the official gene set (OGS), the Maker annotations were filtered by hits to the reference databases and/or a minimum eAED score of 0.1.A second set of gene annotations was generated with the NCBI GNOMON pipeline (annotation release 100 on Nlec1.0 assembly, GCF_001263575.1) (Souvorov et al., 2010).
As the genome was annotated prior to submission to NCBI, we encountered a problem when the NCBI contamination software flagged vector/adaptor sequences for removal; this would disrupt the coordinates provided by Maker.We used a modified version of GAG (Geib et al., 2018) that could accept the flagged coordinates from NCBI to edit the assembly and update annotation coordinates accordingly.

Splice Predictor program from the Berkeley Drosophila Genome
Project was used to help identify intron splice sites (http://www.fruit fly.org/seq_tools/ splice.html).New gene models were added to TBLASTN searches and this process continued iteratively until new chemoreceptors were no longer found.The gene models were checked against RNA-Seq reads from tissue-specific transcriptomes (adult antennae, mouthparts, heads, legs, genitalia, and larval heads; Herrig et al., 2021) and against orthologs in the N. pinetum draft genome assembly (NCBI accession GCA_004916985.1).

| Odorant-binding proteins
Custom scripts were used to identify Maker gene annotations (see Section 4.6.1)that contained the classic/6C, Plus-C, Minus-C, or atypical odorant-binding protein (OBP) motif (Xu et al., 2009).These as well as OBPs from Apis mellifera and Nasonia vitripennis were used as queries for TBLASTN (Altschul et al., 1990) searches against the N. lecontei genome; searches did not yield any new OBPs.All genomic regions identified as potential OBPs were manually curated as described for chemoreceptor genes.After manual annotation, duplicate annotations or genes that lacked OBP motifs were removed.

| Cytochrome P450 genes
A broad set of 52 insect CYP genes (covering the diversity of insect CYP families) was searched against the N. lecontei genome assembly (E-value cutoff 1e3).Scaffolds with hits were then searched against 8782 known insect CYPs.The top 10 hits were returned (later increased to 15 to recover more sequences) and filtered for duplicates.
An alternative search of the NCBI GNOMON predictions ("Neodiprion lecontei[orgn] AND P450 NOT reductase") was also performed and new sequences were added to the dataset.This approach found all the loci identified by the initial search, indicating that the GNO-MON annotation tool was able to comprehensively search for CYP sequences.Finally, the candidate N. lecontei CYP sequences were manually curated based on comparison to the best BLAST hits.

| Immune-related genes
Because of the relative completeness of its immune annotation, Drosophila melanogaster immunity genes were used to guide annotation.Reference immune genes from D. melanogaster tagged with the gene ontology term "GO:0002376 -Immune system process" were compiled from Flybase (release 6.13).Orthology with N. lecontei proteins was assigned initially with reciprocal BLASTP (Altschul et al., 1990) searches (E-value cutoff 1e-10).Reference D. melanogaster genes without obvious one-to-one orthologs in N. lecontei were examined individually to determine whether closely related paralogs in one or both species interfered with the inference of orthology.If not, they were searched against the N. lecontei genome assembly using TBLASTN (Altschul et al., 1990) in an attempt to identify unannotated orthologs.
Since antimicrobial peptides (AMP) are unlikely to be conserved between D. melanogaster and N. lecontei, AMPs from three representative hymenopterans Apis mellifera (Danihlík et al., 2015), Nasonia vitripennis (Tian et al., 2010), and Camponotus floridanus (Gupta et al., 2015;Ratzka et al., 2012;Zhang & Zhu, 2012) were used for BLAST queries.Furthermore, since AMP copy number is fast evolving, we attempted to find all the N. lecontei orthologs of each hymenopteran AMP instead of focusing on one-to-one orthology.Once again, BLASTP searches were performed against the annotated proteins and TBLASTN searches were performed against the assembled genome; the TBLASTN search did not reveal additional AMPs.Putative N. lecontei orthologs were reciprocally blasted against the appropriate hymenopteran proteome to assure that the best hits were indeed AMPs.
Amino acid and cDNA sequences for all manual annotated genes are available at Dryad.https://doi.org/10.5061/dryad.n8pk0p320

| Glomeruli segmentation
Whole-brain images of one female and one male were manually segmented using the TrakEM2 software package in ImageJ (Cardona et al., 2012;Schindelin et al., 2012).Individual glomeruli were traced in both brain hemispheres.Glomeruli near the center of the antennal lobe can be difficult to distinguish, meaning counts are biased toward fewer glomeruli and the largest number of glomeruli confidently detected represents a minimum of the number of expected glomeruli.Male Neodiprion have a collection of smaller synaptic clusters in their antennal lobe (Dacks & Nighorn, 2011), but the functional significance of this anatomy is not known.There are more than 50 of these smaller synaptic clusters and we suspect they do not represent the traditional one-to-one OR-to-glomerulus organization.Therefore, these structures were not included in counts.
Male glomeruli number may be lower if particular OSNs contribute to these clusters instead of forming traditional glomeruli.

| Clustering and pseudogene analyses
To evaluate the extent to which members of our five focal gene families were located in tandem arrays, we placed our annotated genes on a linkage-map anchored version of the N. lecontei genome assembly described in (Linnen et al., 2018).We considered genes to be clustered if they were located within a genomic region of 20(n -1) kilobases, where n is the number of genes in the cluster under consideration.This criterion was chosen based on average gene densities in Nasonia (Niehuis et al., 2010) and clustering criteria described Drosophila (Vieira et al., 2007).For scaffolds that could not be placed on linkage groups, we evaluated clustering only if genes were more than 20 kb from either scaffold end.For branch tests, the cDNA phylogenies for each N. lecontei gene family were used to compare the lineage-specific clade to the rest of the gene family.To determine if the foreground branch dN/dS (i.e., the branch with the species-specific expansion) was significantly higher than the background (i.e., the rest of the gene family), for that clade we ran a two-ratio codeml model (Model = 2, fix_omega = 0) and a one-ratio model (Model = 0, fix_omega = 0) and performed a likelihood-ratio test (chi-square distribution = upper tail).To determine if the foreground branch is evolving under selection (dN/dS ≠ 1), we performed a likelihood-ratio test (chi-square distribution = two tail) comparing the two-ratio model to a two-ratio neutral model (Model = 2, fix_omega = 1).Our rationale for using a two-tailed test for neutrality and a one-tailed test for comparing the foreground branch to the rest of the tree is that this would enable us to detect scenarios in which a locus evolving under purifying selection (dN/dS ≠ 1) experiences increased positive selection (or relaxed purifying selection) at some sites upon entry into a novel niche.A PPEN D I X 1 F I G U R E A 1 Select Hymenoptera olfactory receptor gene family tree (amino acid).Included genes were manually annotated, intact (not partial or pseudogene), with a minimum length of 350 AA.

F I G U R E A 2
Neodiprion lecontei olfactory receptor gene family tree (cDNA).Genes were manually annotated and all annotations (intact, pseudogene, and partial) were included.
F I G U R E A 3 Select Hymenoptera gustatory receptor gene family tree (amino acid).Included genes were manually annotated, intact (not partial or pseudogene), with a minimum length of 350 AA.

F I G U R E A 4
Neodiprion lecontei gustatory receptor gene family tree (cDNA).Genes were manually annotated and all annotations (intact, pseudogene, and partial) were included.
F I G U R E A 5 Select Hymenoptera odorant-binding protein gene family tree (amino acid).Included genes were manually annotated, intact (not partial or pseudogene), with a minimum length of 100 AA.

F I G U R E A 6
Neodiprion lecontei odorant-binding protein gene family tree (cDNA).Genes were manually annotated and all annotations (intact, pseudogene, and partial) were included.
F I G U R E A 7 Select Hymenoptera cytochrome P450 gene family tree (amino acid).Included genes were manually annotated, intact (not partial or pseudogene), with a minimum length of 350 AA.
F I G U R E A 8 Neodiprion lecontei cytochrome P450 gene family tree (cDNA).Genes were manually.
F I G U R E A 9 Select Hymenoptera hisnavicin gene family tree (amino acid).Included genes were intact (not partial or pseudogene).The criteria for an intact gene annotation varied across studies.Splice variants were not included.c Not explicitly declared in publication(s).Value = 0 indicates the genes were searched for but not found.Blank cells indicate the gene family was not TA B L E A 9 (Continued)

F
Diprionids transitioned from angiosperms to coniferous host plants and N. lecontei has multiple morphological and behavioral adaptations to Pinus foliage.(a) An egg-laying N. lecontei female demonstrating several adaptations for ovipositing in thick, resinous pine needles, including: a robust saw-like ovipositor (visible within the needle), a tendency to lay many closely spaced eggs per needle, and a tendency to cut resin-draining slits on egg-bearing needles (circled).(b) Throughout development, embryos are in close contact with living host tissue.Prior to hatching, N. lecontei eggs absorb water from the host, causing the eggs to swell and their pockets to open.(c) Earlyinstar larvae have skeletonizing feeding behavior in which only the outer needle tissue is consumed, leaving the resinous interior intact.This strategy prevents small larvae from being overwhelmed by sticky resin.(d) Mid-and late-instar larvae consume the entire pine needle.Larvae sequester pine resin in specialized pouches for use in self-defense (All photos by R.K. Bagley).

f
Low bootstrap support precluded the identification of Neodiprion-specific clades.F I G U R E 2 Optical sections through the antennal lobes of adult female (left) and male (right) N. lecontei.White arrows indicate regions of male-specific synaptic clusters.Scale bars = 500 μm.F I G U R E 3 Number of intact genes in hymenopteran genomes for each of five environmentally responsive gene families.Phylogenetic relationships are as in Moreau et al. (2006), Hedtke et al. (2013), Roux et al. (2014), Brand et al. (2017), Branstetter et al. (2018), Peters et al. (2017
David Nelson, unpublished data)  and the Asian longhorned beetle (Anoplophora glabripennis)(McKenna et al., 2016).Notably, N. lecontei larvae frequently ingest pine bark in addition to pine needles(Wilson, 1991), suggesting that CYP3 may expand predictably in wood feeders.Additionally, one of the two Neodiprion-specific CYP3 clades with possible evidence of positive selection (CYP clade 3) (Table predictability of gene family expansion or contraction in response to specific selection pressures is still an open question.Here, we investigated the evolutionary history of five environmentally responsive gene families in the N. lecontei draft genome, a hymenopteran exemplar of a pine-specialized lineage.Although we saw minimal evidence of recent gene loss via pseudogenization, at least four gene families (OR, GR, CYP, and AMP) had patterns consistent with recent expansions, and three of these families (OR, GR, and CYP) also had possible evidence of positive selection within Neodiprion-specific clades.Based on these data, we hypothesize that these gene families contributed to pine adaptation in diprionids and possibly other host-specialized insects.Testing this hypothesis requires the comparative analysis of high-quality assembly and annotation data from phylogenetically and ecologically diverse insect species.For hymenopterans, increased effort in understudied symphytan, parasitoid, and herbivorous taxa would be especially useful for disentangling different axes of ecological variation contributing to changes in gene family size.For greater insight, annotation data from a greater diversity of environmentally responsive have haplodiploid sex determination in which males (haploid genomes) emerge from unfertilized eggs and females (diploid genomes) from fertilized eggs.A virgin female will bear a clutch of all-male offspring with haploid recombinants of the maternal genome.But the individual genomes are not identical, so an assembly derived from a single clutch is akin to a diploid assembly made from a single individual.All insects were reared in custom, climate-controlled environmental chambers (18:6 light cycle, 22°C, 70% RH) on jack pine (Pinus banksiana) foliage.Our laboratory line of N. lecontei was established from multiple larval colonies collected from a mugo pine (P.mugo) in Lexington, Kentucky, USA (37°59′01.6″N 84°30′38.8″W; population ID: RB017).For the transcriptome, adults and larvae were collected from the first laboratory-reared generation; both were stored at −80°C.For the genome assembly, the founding population was propagated in the lab for two generations, followed by brother-sister matings for an additional two generations.At this point, a single, virgin, adult female (I2G2-V, 4th generation in the lab) was allowed to lay unfertilized eggs onto jack pine seedlings.The offspring (haploid male brothers from an inbred mother) were reared until the eonymph (prepupal) life stage, at which point they were isolated without food for 24 h prior to preservation in absolute ethanol at −20°C.Although eonymphs are nonfeeding, they were starved to minimize residual gut content.4.2 | Sample preparation and sequencing4.2.1 | Genomic DNAWhole eonymph bodies were individually frozen inside microcentrifuge tubes with liquid nitrogen and ground with pestles made from 1-mL micropipette tips; the resulting powder was incubated in cetyltrimethylammonium bromide (CTAB) buffer supplemented with proteinase K and RNase A. After phenol-chloroform-isoamyl alcohol (PCI) extraction and ethanol precipitation, the precipitate was dried overnight before being resuspended in tris-EDTA (TE) buffer.DNA integrity was assessed with 0.7% agarose gel, purity was measured with the 260/280 ratio, and concentration was measured with a Quant-iT dsDNA High-Sensitivity fluorescence assay (Thermo Fisher Scientific).The HudsonAlpha Genomic Services Lab (Huntsville, AL, USA) prepared and sequenced the DNA libraries.Two small-insert, barcoded libraries with average fragment sizes of 337 bp and 864 bp were made from a single individual.A 4.6-kbp mate-pair, barcoded library was made from 25 pooled individuals.All individuals were brothers from the same I2G2-V mother.The libraries were sequenced on Illumina HiSeq 2000 with paired-end, 100 bp (PE100) 1.11)(Marçais & Kingsford, 2011).Finally, reads were screened for sequencing contamination by mapping the reads (BWA v0.7.12-r1039)(Li & Durbin, 2009)  to reference genomes for Escherichia coli (K12 substr.DH10B uid58979), human (v37), loblolly pine (Pinus taeda, v0.8), and Wolbachia (endosymbiont of Dmel uid57851).The genome was assembled with ALLPATHS-LG (v47417) (Gnerre et al., 2011) using default settings, including a minimum scaffold size of 1000 bp.The error-correction module was run on the reads prior to assembly.After assembly, GapFiller (v1.11)(Boetzer & Pirovano, 2012) was used to help close intrascaffold gaps.Spurious scaffolds were identified with SOAP.coverage (v2.7.7)(Li et al., 2009):
For genome size estimation, we used adult males and females from a lab line of N. lecontei established from a colony collected in Auburn, GA (33°59′22.4″N, 83°47′44.6″W; population ID: RB027).Briefly, cell nuclei were collected from the heads of 7 individuals (4 female, 3 male) and stained with propidium iodide.Mean fluorescence for each sample was measured with a BD FACSCalibur flow cytometer (BD Biosciences) and compared to two external standards: Drosophila melanogaster (adult female heads, 1C = 175 Mbp) and Gallus gallus domesticus (CEN singlets from BioSure, Grass Valley, CA, 1C = 1222.5

4. 6 |
Gene and functional annotation 4.6.1 | Automated gene annotation RNA-Seq data for N. lecontei was used to generate training models for gene prediction along with utilization of peptide sequences from other species.PASA (r20130425beta) was used to build a comprehensive transcriptome set from Trinity assembled transcripts along with RNA-Seq read mapping predictions generated from the Tuxedo pipeline.To improve annotation quality, in addition to this N. lecontei transcriptome, annotated proteins from Atta cephalotes (OGSv1.2),Acromyrmex echinatior (OGSv3.8),Apis mellifera (OGSv3.2),Athalia rosae (OGSv1.0),and Nasonia vitripennis (OGSv1.0)were provided adult N. lecontei of both sexes were fixed in 2% paraformaldehyde, 2% glutaraldehyde in phosphate-buffered saline (PBS) for 5 days.Heads were rinsed for 40 minutes three times and the brains dissected out in cold PBS.Following blocking with goat serum, brains were permeabilized with 1% Triton X-100 in PBS (Electron Microscopy Supply, Fort Washington, PA; PBS-TX), rinsed with 0.1% PBS-TX, and incubated on a shaker at 25°C for three nights in primary antibody (1:500 in 2% goat serum in 0.2% PBS-TX).Monoclonal Drosophila synapsin I antibody (SYNORF1, AB_2315426) from the Developmental Studies Hybridoma Bank (catalog 3C11) was used to label synapsin.Subsequently, brains were washed in 0.1% PBS-TX and incubated for two nights in Alexa Fluor 568 (Thermo Fisher) goat antimouse secondary antibody (1:100 in PBS) in the dark at room temperature on a shaker.After secondary incubation, brains were rinsed with distilled water, dehydrated in increasing concentrations of ethanol, and mounted in custom-made aluminum well slides.Brains were cleared by removing ethanol and replacing it with methyl salicylate.Brains were imaged on an inverted Zeiss 880 Laser Scanning Confocal Microscope with a 20X plan-Apochromat 20x 0.8 aperture objective and optically sectioned in the horizontal plane at 3-micron intervals.

4. 7
.4 | Identification of Neodiprion-specific clades and tests of positive selection First, we identified clades unique to N. lecontei.For each gene family, a multispecies, amino acid phylogeny was constructed with manually curated annotations from N. lecontei, select Hymenoptera, and D. melanogaster.Intact sequences were size filtered (350 ≥ for GR, OR, CYP; 100 ≥ for histnavicin and OBP); pseudogenes and partial annotations were excluded.MAFFT alignments (v7.305b) (Katoh et al., 2002) (parameters: --maxiterate 1000 -localpair) were visually inspected to remove sequences with large alignment gaps, and sites with more than 20% gaps were removed with trimAl (v1.4.rev15 build[2013-12-17]) (Capella-Gutiérrez et al., 2009) (parameters: -gapthreshold 0.8).Maximum likelihood phylogenies were made in RAxML (v8.2.4) (Stamatakis, 2014) (parameters: -f a -x 12,345 -p 12345 -# autoMRE) using protein substitution models chosen from ProtTest3 (v3.4.2) (Abascal et al., 2010; Darriba et al., 2011).Neodiprion-specific clades were defined as those with at least four N. lecontei genes (not including partial and pseudogenes) and reasonable bootstrap support (>70%).Second, the clades were confirmed with cDNA phylogenies for each N. lecontei gene family.Amino acid sequences were aligned as above, however, after alignment TranslatorX (Abascal et al., 2010) was used to map cDNA sequences to the amino acid alignment.After trimming, the cDNA alignments were passed to RAxML to construct maximum likelihood gene family trees with the nucleotide substitution model -m GTRGAMMA.Site tests were conducted with codeml (part of the PAML package (PAML v4.9e; Yang, 2007) using the cDNA phylogenies and sequences as inputs.For each Neodiprion-specific clade, the gene family cDNA phylogeny was pruned to remove all branches except for that clade.Codeml models M7, M8, and M8a were fitted to the cDNA sequence and phylogeny data.Likelihood-ratio tests (chisquare distribution = upper tail) were performed for the nested models M7-M8 (null model M7 equally distributes amino acid sites across 10 classes of ω parameter values (p, q) against alternative model M8 that has an 11th class for positively selected sites) and M8-M8a (null model M8a that has 11 classes and does not allow positive selection against alternative model M8).

F
Neodiprion lecontei hisnavicin gene family tree (cDNA).Genes were manually annotated and all annotations (intact, pseudogene, and partial) were included.FI G U R EA 11 Unique N. lecontei Hisnavicin amino acid motif alignment.Residues identical to the consensus are colored.Innate immune pathways overview (based on Drosophila melanogaster).

a
Comparison of select insect antimicrobial peptide gene family sizes (manually curated gene annotation datasets).The criteria for an intact gene annotation varied across studies.Splice variants were not included.bNot explicitly declared in publication(s).Value = 0 indicates the genes were searched for but not found.Blank cells indicate the gene family was not studied.

Clade names a n b Model comparison c LRT statistic d df p-Value e ω (dN/dS) f
Clade names are as in Figures A1 through https://doi.org/10.5061/dryad.n8pk0p320. a (Oeyen et al., 2020).The sawflies Athalia rosae and Orussus abietinus each had one Dorsal ortholog and no Dif orthologs(Oeyen et al., 2020).Since Dif is also missing in the Apocrita(Oeyen et al., 2020), N. lecontei may have a Dorsal duplicate not found in the other sawflies.However, N. lecontei also had single copies of Toll-1 and spaetzle; other Hymenoptera (including A. rosae and O. abietinus) have at least five copies of Toll-1 and two copies of (Ha et al., 2005;Lee etZambon et al., 2005) involved in immune deficiency signaling.The encapsulation/melanization pathway was missing one of the two Drosophila GTPases (Rak2); the N. lecontei Rak1 ortholog may be playing both roles, but again this is likely due to the difficulty of assigning one-to-one orthologs.The Duox pathway was missing the top G-protein coupled receptor, but this is unknown in D. melanogaster and unidentified in other Hymenoptera(Evans et al., 2006).Interestingly, N. lecontei had two copies of Dual Oxidase (Duox), which regulates commensal gut microbiota and infectious microbes(Ha et al., 2005;Lee et al., 2015); Apis mellifera had one copy.Finally, the Toll pathway NF-kappaB transcription factor, Dorsal-related immunity factor (Dif) does not have a one-to-one ortholog in N. lecontei, but two copies of its paralog, spaetzle (TableA7).Since the Toll pathway regulates the expression of some AMPs(Lourenço et al., 2018;Zambon et al., 2005), it is possible that in N. lecontei the Hisnavicin gene expansion is compensating for the loss of Toll-1 and spaetzle.
Comparison of genome assembly and annotation statistics from published hymenopteran draft genomes.
a Assembly and official gene set (OGS) version are v1.0 unless reported otherwise.b Methods used to measure genome size: fc , flow cytometry; k , kmer analysis; fd , Feulgen densitometry; rk , reassociation kinetics; ?, unknown.c Scaffold length with gaps.

TABLE A2 (
Continued) TA B L E A 3 Summary of elements identified in the N. lecontei genome.dnaPipeTE

number of families dnaPipeTE percentage RepeatModeler number of families RepeatModeler percentage (fraction of assembly) RepeatModeler percentage (fraction of total genome) Portion of repetitive genome missing in assembly percentage
Tissue types, read counts, and transcript counts for the N. lecontei transcriptome.Due to a sample preparation error eonymph head mRNA was unavailable.Glomeruli counts from left and right antennal lobes of an adult N. lecontei male and an adult N. lecontei female.
a TA B L E A 6 Summary of N. lecontei AMP orthology with other Hymenoptera.AMP family

Neodiprion lecontei orthologs Nasonia vitripennis orthologs Apis mellifera orthologs Camponontus floridanus orthologs
Reciprocal best Blast hit when more than one sequence is listed.Note that while the best blast hit for nasonin-2 was Nlec_TEN2_1-mRNA-1, this was not a reciprocal best blast hit and Nlec_TEN2_1-mRNA-1 appears to be a teneurin.Immunity gene orthologs between N. lecontei and D. melanogaster.Comparison of olfactory receptor (OR), gustatory receptor (GR), and odorant-binding protein (OBP) gene families (manually curated gene annotation datasets).
b Best blast hit for A. mellifera AMPs vs. N. lecontei proteins (not reciprocal).cdBest blast hit for N. vitripennis AMPs vs. N. lecontei proteins (not reciprocal).TA B L E A 7a May include incomplete, pseudogenized annotations.b