Population genomics and phylogeography of the boll weevil, Anthonomus grandis Boheman (Coleoptera: Curculionidae), in the United States, northern Mexico, and Argentina

Abstract The boll weevil, Anthonomus grandis Boheman (Coleoptera: Curculionidae), is an important pest of commercial cotton across the Americas. In the United States, eradication of this species is complicated by re‐infestations of areas where eradication has been previously successful and by the existence of morphologically similar variants that can confound identification efforts. To date, no study has applied a high‐throughput sequencing approach to better understand the population genetic structure of the boll weevil. Furthermore, only a single study has investigated genetic relationships between populations in North and South America. We used double digest restriction site‐associated DNA sequencing (ddRADseq) to resolve the population genomic structure of the boll weevil in the southern United States, northern Mexico, and Argentina. Additionally, we assembled the first complete mitochondrial genome for this species and generated a preliminary whole genome assembly, both of which were used to improve the identification of informative loci. Downstream analyses revealed two main lineages—one consisting of populations found geographically west of the Sierra Madre Occidental mountain range and the second consisting of populations found to the east—were revealed, and both were sub‐structured. Population geographic structure was consistent with the isolation by distance model, indicating that geogrpahic distance is likely a primary mechanism driving divergence in this species. Boll weevil populations from Argentina were found to be more closely related to the eastern lineage, suggesting a recent colonization of South America by the eastern lineage, but additional sampling across Mexico, Central America and South America is needed to further clarify their origin. Finally, we uncovered an instance of population turnover or replacement, highlighting the temporal instability of population structure.


| INTRODUC TI ON
Accurately describing the population structure and dynamics of species is fundamental to understanding their geographic distributions and evolutionary history. This information is especially important for integrated pest management, which needs to consider pest evolution (Gassman et al., 2009;Pélissié et al., 2018). For widespread pest species, it is critical to understand broad-scale patterns of gene flow because inappropriately informed control strategies can be compromised by source-sink dynamics that could nullify the effects of local suppression (Carrière et al., 2012;Hanski & Gilpin, 1991;Harrison, 1991;Sword et al., 2010;Zaller et al., 2008). Thus, it is important to understand how geography influences gene flow among widespread pest species. In the border regions of the southern United States Curculionidae). In addition to representative specimens from multiple populations across the species' North American distribution, we have also sampled populations from South America and inferred the genetic relationship of these populations to their North American counterparts. We also sampled collection localities temporally to assess the stability of the population genetic structure over time.
The boll weevil is a major economic pest of commercially cultivated upland cotton, Gossypium hirsutum L. (Malvaceae), across the Americas. Native to Mesoamerica, it is generally accepted that the most recent common ancestor of the boll weevil originated in southern Mexico and Central America and diverged from the sister species, Anthonomus hunteri Burke and Cate, during the Pliocene (Alvarado et al., 2017;Burke et al., 1986). The original host plant for the ancestral weevil was probably of the genus Hampea Schltdl. (Malvaceae). The weevil underwent at least one host shift to one or more endemic Gossypium L. species and later shifted to G. hirsutum after its cultivation began in the Americas. The species has since expanded its geographic distribution over time, likely in association with the expansion of the cultivation of commercial cotton. In the late 1800s, the boll weevil underwent a northward range expansion through Mexico and eventually across the entire cotton-producing region of the southern United States (US), where it became an infamous agricultural foe (Burke et al., 1986;Lange et al., 2009). A second range expansion has more recently occurred in South America. First recorded in Venezuela in 1949, the species' range expanded to Colombia in 1951, Brazil in 1983, Paraguay in 1991, Argentina in 1993, and Bolivia in 1997(Scataglini et al., 2006.
By 2016, the boll weevil had spread as far south as the Argentine province of Santiago del Estero and as far west as the province of Salta. In Mexico, Central America, and South America, the boll weevil remains widely regarded as the most important pest of cotton agriculture.
Boll weevil management in North America is complicated by the existence of morphologically and genetically similar variants that can confound diagnostic efforts (Barr et al., 2013;Burke et al., 1986;Burke, 1968;Fye, 1968;Roehrdanz, 2001;Warner, 1966). This, in turn, inhibits rapid response to weevil outbreaks and limits managers' ability to identify unmanaged populations that may act as sources for re-infestations of previously eradicated areas. Further, recent genetic investigations of boll weevil populations have suggested that the current subspecific taxonomy may not accurately describe the reality of the population structure. Classic descriptions of boll weevil variants have generally referred to three forms: (1) the southeastern boll weevil (A. g. grandis); (2) the Thurberia weevil (A. g. thurberiae), which has traditionally been regarded as a host-race associated with Arizona wild cotton, Gossypium thurberi Todaro (Malvaceae); and (3) the Mexican boll weevil, an intermediate form that has never been given any formal subspecies designation (Burke et al., 1986;Cross et al., 1975;Warner, 1966). Under this "three-form" hypothesis, the southeastern boll weevil is described as having a geographic range stretching across the southeastern United States extending south into northern Mexico, east of the Sierra Madre Occidental mountain range, where it overlaps with the Mexican boll weevil variant; the Mexican boll weevil has a U-shaped distribution throughout much of Mexico's lowlands, with only a slight overlap with the very limited range of the Thurberia weevil in southern Arizona and northern Sonora (Supporting Information A). These subspecific denominations are based primarily on morphological characteristics that are notoriously unreliable and may be labile to diet (Barr et al., 2013;Roehrdanz, 2001). Combined with the overlapping ranges of the three forms, these inadequate morphological descriptions have led to inconsistent application of taxonomic status to boll weevil populations throughout the literature and can cause confusion for management.
Recent research has suggested that the boll weevil variant designated as A. g. thurberiae may be divergent due to the hypothesized geographic barrier to gene flow of the Sierra Madre Occidental mountain range, rather than due to any host plant association (Alvarado et al., 2017;Kuester et al., 2012). These studies have opposed the three-form hypothesis altogether, instead proposing a genetic two-form hypothesis wherein the Thurberia K E Y W O R D S boll weevil, ddRADseq, phylogenetics, phylogeography, population genetics, population genomics weevil should be regarded as a uniquely host-associated population of a more widely distributed western genetic lineage whose distribution stretches southward along the western side of the Sierra Madre Occidental mountain range. The second form under this hypothesis consists of populations who are members of an eastern genetic lineage with a distribution east of the Sierra Madre Occidental mountain range and continuing into southern Mexico.
The proposed zone of contact for the two forms is along the southern Pacific coast of Mexico, which is part of the historical range from where the species initially expanded its range northward along the eastern and western fronts (Alvarado et al., 2017;Burke et al., 1986;Kuester et al., 2012).
The primary goal of this study was to elucidate the current population genetic structure of A. grandis across the Americas to better inform management efforts in the United States and northern Mexico. Though a number of studies have investigated the population genetic structure of the boll weevil (Alvarado et al., 2017;Barr et al., 2013;Bartlett, 1981;Kim & Sappington, 2004a, 2004bMartins et al., 2007;Roehrdanz, 2001;Scataglini et al., 2000Scataglini et al., , 2006, none have taken advantage of high-throughput sequencing technology to generate a multi-locus dataset that can provide substantially more resolution than classic population genetic markers. Here, we used double digest restriction site-associated DNA sequencing (ddRADseq, Peterson et al., 2012) to generate a genome-wide dataset of single nucleotide polymorphism (SNP) markers as a means to better understand spatial and temporal patterns of genetic variation in boll weevil population structure. We generated a preliminary reference genome sequence and used it to inform phylogenetic and population genetic approaches to formally test the two-form and three-form hypotheses and resolve the population structure within the resulting lineages. For some populations, we were able to sample in multiple years, allowing for us to not only resolve the spatial structure of populations, but also to evaluate the robustness of that structure over time. Additionally, we evaluated Argentine populations of the species to determine their relationship to the North American lineages. To conclude, we considered the implications of our findings with regard to the current subspecific taxonomy, the international efforts to control the pest populations of the species, and efforts to resolve the population structures of other pest or nonpest species.

| Specimen sampling
Our sampling regime targeted five main geographic regions. These regions included four commercial cotton production areas: northeastern Argentina (ARG), the lower Rio Grande Valley along the United States-Mexico border (RGV), the Chihuahuan Desert ecoregion in Mexico (CMX), and Sonora, Mexico (SMX). The fifth geographic region was southern Arizona where wild cotton, G. thurberi is native (AWC). A total of 292 weevil specimens were collected and processed across 20 spatiotemporally distinct collections (Table 1). Weevil specimens from the four commercial production areas were mainly collected using boll weevil pheromone-baited cone traps (Cross et al., 1969;Cross & Hardee, 1968;Hardee et al., 1971;Tumlinson et al., 1969), whereas those from Arizona were collected directly from G. thurberi plants using a beat bucket technique wherein branches or whole crowns of plants were shaken into a bucket, dislodging adult weevils into the bottom of the bucket.
Insects from all localities were preserved immediately in 95-100% ethanol. Other than during shipping or transportation, all specimens were stored at −80°C until they were removed from storage for DNA isolation. For those collection localities where there were multiple pheromone-baited cone traps, the midpoint GPS coordinates were determined from the GPS coordinates of the traps using the center of gravity method on the geographic midpoint calculator available at www.geomi dpoint.com.

| DNA isolation, library preparation, and double digest RAD sequencing
The Gentra Puregene Cell and Tissue Kit (Qiagen) was used to isolate genomic DNA from whole weevil specimens. Individuals collected in Note: The "Method" column indicates whether the individuals were sampled from pheromone-baited cone traps near commercial G. hirsutum fields or by beat bucket directly from G. thurberi.
The "Abbreviation" column denotes the code used for each collection throughout the paper and includes the geographic region of the collection (ARG was verified for high molecular weight via electrophoresis on a 1.5% agarose gel. Both methods yielded high quality DNA with fragment sizes greater than 10,000 base pairs.

| Sequence quality control, SNP calling, and filtering
TxGen provided demultiplexed raw reads and FastQC version 0.11.3 (Andrews, 2010) reports for all 292 specimens. FastQC reports were summarized and reviewed using MultiQC version 1.7 (Ewels et al., 2016). Potential bacterial contamination was filtered out by using Kraken version 1.1 (Wood & Salzberg, 2014) to match sequences to the nonredundant bacterial database hosted by the National Center for Biotechnology Information (NCBI). Trimmomatic version 0.38 (Bolger et al., 2014) was used to ensure that reads from different Illumina runs had a uniform length. A length selected based on the MultiQC report was achieved via removal of the first 10 bp of each sequence and truncating each sequence at 90 total bp.
Mitochondrial genes can become duplicated and inserted into the nuclear genome, forming a nuclear-mitochondrial DNA sequence (commonly called a "numt" or pseudogene) that may experience different evolutionary pressures than the mitochondrial parent (Bensasson et al., 2001;Grau et al., 2020;Song et al., 2008 A de novo mitochondrial assembly was initiated using the A. g. grandis cytochrome oxidase subunit 1 gene (GenBank accession number: MF636872.1) as a seed sequence. The mitogenome was annotated using the web-based program MITOS version 1.0 (Bernt et al., 2013) and visualized with GenomeVx version 1.0 (Conant & Wolfe, 2008). Next, FastQ Screen version 0.12.1 (Wingett & Andrews, 2018) was used to map ddRADseq reads to the mitogenome and to remove them from the dataset.
After quality control, trimming, and filtering were completed, we used the software pipeline dDocent version 2.6.0 to map the remaining ddRADseq reads to the Dovetail Genomics genome assembly and to identify putative SNP loci (Puritz, Hollenbeck et al., 2014;Puritz, Matz et al., 2014). dDocent was run using default parameters with a match score value = 1, mismatch score = 4, and gap opening penalty = 6. VCFtools version 0.1.16 (Danecek et al., 2011) (Jombart, 2008;Jombart & Ahmed, 2011) was used to create the "genlight" object needed for many of the downstream analyses.

| Phylogenetic and population genetic analyses
To verify that any observed population genetic structure was not due to sequencing batch effects, we utilized a hierarchical analysis of molecular variance (AMOVA, Excoffier et al., 1992) to test for significant factors contributing to variability in the dataset. Our hierarchical levels were sequencing year and geographic collecting locality.
It is also possible that some observed differences between populations could be due to differences in relatedness between individuals within each population. batches = 10, iterations = 500) were also implemented to determine if genetic differences between pairs of collections were statistically significant. To test whether any observed population genetic structure was consistent with an isolation by distance (IBD) model (Rousset, 1997;Wright, 1943), we used option 6 and sub-option 9 of the web implementation of Genepop version 4.2 (Raymond & Rousset, 1995;Rousset, 2008)  R/adegenet was used to carry out a principal component analysis and to further group collections into putative genetic populations using a discriminant analysis of principal components. The number of groups was determined de novo using the K-means clustering algorithm. We tested 1 ≤ K ≤ 21 using all principal components, 100 starting centroids, and 1,000,000,000 iterations per run. The optimal K was selected using the Bayesian information criterion (Supporting Information D). For the discriminant analysis, we used default parameters and retained six principal components and two discriminant functions. R/ggplot2 version 3.2.1 (Wickham, 2016) was used to visualize the spatial clustering of individual genotypes.
The program fastSTRUCTURE version 1.0 (Raj et al., 2014) was used to calculate each sampled individual's probability of assignment to K predetermined genotypic groups where 1 ≤ K ≤ 21. fast-STRUCTURE was run using default parameters. PLINK version 1.07 (Purcell et al., 2007) was used to convert the vcf file into a format that was suitable for input into both programs. The browser-based program StructureSelector (Li & Liu, 2018) was then used to evaluate the fastSTRUCTURE outputs to choose the optimal K value for our dataset using the "LargeKGreedy" algorithm with 2000 repeats.
The optimal K was selected using a maximized marginal likelihood framework (Supporting Information E). CLUMPAK (Kopelman et al., 2015), which is integrated into StructureSelector, was used to visualize individual assignment probabilities.

| Preliminary reference genome size estimation, sequencing, and assembly
The Dovetail Genomics assembly of the boll weevil genome contained 8017 scaffolds spanning 427.92 Mbp. BUSCO results (Table 2) were consistent with a partial genome assembly of 62.86% of our predicted size based on flow cytometry (680.85 Mbp ± 6.68 std. error). The assembly features low fragmentation (scaffold L50/N50 = 8 scaffolds/22.313 Mbp) and high coverage (mean depth 5973X) for the regions of the partial genome that were successfully sequenced. A full report for the final assembly can be found in Supporting Information F.

| Sequence quality control, SNP calling, and filtering
TxGen provided a total of 1.42 TB worth of sequence data across the two ddRADseq runs, and the sequences were generally found to be of high quality as evaluated by FastQC and MultiQC. Kraken determined that an average of 2.64% of the sequences per individual were putatively bacterial in origin, and those sequences were removed. Though the second sequencing run yielded sequences that were 150 bp, the first only yielded 125 bp sequences, and so we choose a Trimmomatic truncation length of 90 that was more appropriate for the first run than the second. This resulted in some loss of data but limited allele dropout due to the different sequencing run lengths.
We successfully reconstructed the first complete mitochondrial genome for A. g. grandis (Supporting Information G). From the initial shotgun sequencing, NOVOPlasty identified 0.35% of the input reads as mitochondrial in origin and assembled 46,138 reads with a 653X average depth of coverage. The assembly consisted of a single, circularized contig with a total sequence length of 17,089 bp and included 22 tRNA genes, 2 rRNA genes, 13 protein-coding genes, and a major noncoding, AT-rich control region. These characteristics are typical of coleopteran mitochondrial genomes (Cameron, 2014;Liu, Bian et al., 2016;Ojo et al., 2016;Sheffield et al., 2008), but the overall length is slightly shorter than has been previously reported in boll weevil (roughly 18,000-19,000 bp, Roehrdanz, 2001). The mitogenome assembly was used to filter potential false-positive SNP loci due to the occurrence of numts in the genome, and we removed at least one ddRADseq locus and a small number of other reads that were present with low coverage.
Our dDocent run identified 116,524 homologous variant loci which were ultimately filtered to 442 SNP loci.

| Phylogenetic and population genetic analyses
The phylogenetic reconstruction ( Figure 1) and the population genetic analyses (Figures 2 and 3)

| D ISCUSS I ON
We have generated the first partial whole genome sequence and the first complete mitochondrial genome assembly for A. grandis.
Our whole genome assembly, which captures roughly two-thirds of the estimated genome size, remains a work in progress and should be considered as such. Nonetheless, the portions of the genome that we have assembled are of high quality and are reliable. Using this assembly as a reference for our SNP calling likely had the effect of reducing the total number of loci detected due to ddRADseq reads not matching any part of the reference. Nonetheless, the reads and reference qualities are high, and we used very stringent filters, so the loci we did obtain for subsequent analyses are not compromised due to the partial assembly. Our mitogenome assembly was also somewhat shorter than expected. Nonetheless, the assembly has high completeness, contiguousness, and coverage depth. So, this discrepancy in length is likely attributable to the assembly software underestimating the number of repeats in the AT-rich control region. The position of the tRNA-isoleucine (trnI) has undergone a rearrangement into the middle of the control region, differing from the ancestral arrangement found in most insects. Similar rearrangements have been found in Sitophilus oryzae and S. zeamais (Ojo et al., 2016), and complete losses of the trnI have been documented in other weevils (Liu, Gao et al., 2016;Nan et al., 2016;Song et al., 2010;Tang et al., 2017). Further investigation of these rearrangements and losses in the Curculionoidea may be warranted, as they may be taxonomically informative for higher level phylogenetics.
Overall, the results of our phylogenetics and populations genetics analyses indicated that the sampled individuals represented two main boll weevil lineages and that those lineages were highly F I G U R E 4 Pairwise genetic distances plotted as a function of pairwise geographic distances. Mantel test for isolation by distance yielded Pr(correlation > observed correlation) = 0.00000 under null hypothesis. Inset shows the equation and R 2 value for the linear regression sub-structured. The genetic structure was intimately tied to the geography of the populations. There was strong support for IBD as a mechanism for divergence. We also found evidence of temporal instability, suggesting that the population structure could be labile to time.

| Revisiting the two-form and three-form hypotheses with implications for taxonomy
Two opposing hypotheses of boll weevil variation have been de-  (Alvarado et al., 2017;Scataglini et al., 2000), such levels are not typically expected within a single species. Nonetheless, our F ST values are relatively consistent with our other analyses, and other types of genetic markers have yielded similarly high estimates (Alvarado et al., 2017;Scataglini et al., 2000), so it remains possible that boll weevil populations are highly divergent in a way that warrants further exploration. Regardless, there is little support for the Thurberia weevil, A. g. thurberiae, to warrant subspecific taxonomic status. Though the authors recognize the unique association of some populations in the western lineage with G. thurberi, those populations, in our study, were not more distinct from populations infesting commercial cotton in Sonora than populations in the lower Rio Grande Valley were from those in the Chihuahuan Desert. If those G. thurberi populations warrant subspecific status, then so should many of the other sampled populations.
Despite the host association in those populations, their genetic divergence from other populations could just as easily be explained by IBD (Figure 4). It is also critical to consider that other populations of boll weevils that we did not sample may also utilize alternative host plants. In addition to other species of Gossypium, boll weevil has been documented in association with other Malvaceae including Hampea spp., Cienfuegosia spp., and Thespia populnea (Burke et al., 1986;Cross, 1973;Cross et al., 1975).

| On the origin of the South American range expansion
Consistent with Scataglini et al. (2006), we found that Argentine boll weevils were more closely related to, and likely derived from, the

| Considerations for boll weevil management
Our results, based largely on weevils from commercial cotton, showed that there was a large amount of genetic diversity across geographic sampling regions. However, within geographic sampling regions, diversity was much lower, suggesting that in a given region, there is typically a single, contiguous population (Figures 2 and 3).
Management and ongoing eradication efforts must recognize that when an area is dominated by such a population, control efforts must be implemented at the area-wide scale, as local management efforts will be ineffective if the population can re-establish from nearby source populations with the same genetic profile. This is especially important when considering the role of alternative host plants as a reservoir host for boll weevil. In the western lineage, weevils infesting G. thurberi in Arizona were not so different from those infesting commercial cotton in Sonora (Figures 1 and 2), and there was evidence of introgression of the AWC genetic group into the SMX collection in 2017. This suggests that, at least, these populations are geographically close enough to interbreed under the right conditions, and that, at worst, could directly provide migrants with the ability to infest commercial cotton. While management cannot realistically control populations in natural areas, this insight should be considered when developing trapping schemes for monitoring. In the lower Rio Grande Valley, managers must acknowledge the contiguity of boll weevil populations along the United States-Mexico border and that effective management will require a coordinated international effort to successfully combat this pest.
More broadly and perhaps most critically, managers must recognize the reality of the temporal instability of pest population genetic structure (Choi et al., 2011;Dinsdale et al., 2012;Lainhart et al., 2015;Raszick et al., 2020). In this study, we discovered a case of population replacement or turnover in the RGV-Tex sampling locality from 2014 to 2016 (Figures 2 and 3 (Barr et al., 2013;Kim et al., , 2008. It is imperative that future studies consider the possibility that population structure can be influenced by yearly changes in habitat quality or by management and eradication programs themselves, and studies may need to be repeated to monitor for potential shifts in the population structure.

| Implications for population genetic studies and management of other pest species
Temporal instability of population genetic structure is not a novel discovery. Studies of economically or ecologically important fish species have long monitored changes in allele frequencies over time (Garant et al., 2000;Glover et al., 2012;Karlsson & Mork, 2005;Skaala et al., 2006), and the phenomenon has not been ignored in entomological research. Local genetic turnovers similar to the one observed in this study have been previously documented on comparable time scales in boll weevil populations in parts of Texas and Mexico using microsatellites (Choi et al., 2011), as well as in other insect pest species (Dinsdale et al., 2012;Lainhart et al., 2015;Raszick et al., 2020). Temporal instability, regardless of whether it is due to a natural process or due to management, must be considered for long-term effectiveness of area-wide management. As populations change, so too may the patterns of migration and divergence. This may be critical for addressing source-sink dynamics in cropping systems or when monitoring for resistant populations. Even for nonpest species, regular monitoring of the population genetic structure over time could provide insights for conservation or for the effects of habitat loss or climate change. At the very least, population genetics studies should recognize that population structure is dynamic, so sampling only a single point in time is akin to taking a snapshot.
When possible, populations should be resampled over multiple time points, and the design of the study should allow for additional data to be incorporated as future populations are resampled, enabling comparisons of structure over time.

ACK N OWLED G EM ENTS
This research was carried out as part of Raszick's PhD dissertation, and he would like to thank his committee members Hojun Song, Aaron Tarone, and Giri Athrey, for their unflagging support and considerable insight. All authors would like to thank Richard Metz of Texas A&M AgriLife Genomics and Bioinformatics Service for logistic and technical support. We would also like to thank Jason Wulff, formerly of Texas A&M University, for his assistance with bioinformatics troubleshooting. Isaac Esquivel and Jacki Kuzio of Texas A&M University are thanked for contributing GIS expertise to the maps used for Figure 3, and Norman Barr is thanked for taking the time to review the manuscript and provide comments.
We are grateful to the reviewers who donated their time and effort to a thorough and meaningful review of this manuscript. This Research Computing. Colorblind-friendly figures were created using the palettes provided at https://perso nal.sron.nl/~pault/.

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interest to declare.

AUTH O R CO NTR I B UTI O N S
TJR carried out the majority of the analyses and wrote the manuscript. CMD provided custom scripts and generally supported the bioinformatics used for the population genetic analyses. LCP conducted the mitochondrial genome assembly and annotation and the relatedness analysis. AET contributed to the bioinformatics and downstream analyses in R. CPCS, RRA, TNB, and MRF coordinated and supported sample acquisition and contributed intellectually. JSS supported the genome size estimation. GAS provided resources and laboratory space for the study, collected samples, helped write the manuscript, and contributed intellectually. All authors were given opportunity to review and contribute to the manuscript before submission into consideration for publication.

DATA AVA I L A B I L I T Y S TAT E M E N T
The full mitochondrial genome sequence for A. g. grandis used in this study has been made available through the NCBI GenBank under accession number MT319132, and raw reads generated by ddRADseq have been made available through the NCBI Sequence Read Archive (SRA) under BioProject ID PRJNA623635. The preliminary whole genome assembly that was ultimately used for ddRADseq read mapping can be made available upon request to the authors. Unix and R scripts used for the quality control and analysis of ddRADseq data can be found on GitHub: https://github.com/tjras zick/boll_weevil_pop_gen.