RNA‐Seq‐Pop: Exploiting the sequence in RNA sequencing—A Snakemake workflow reveals patterns of insecticide resistance in the malaria vector Anopheles gambiae

Abstract We provide a reproducible and scalable Snakemake workflow, called RNA‐Seq‐Pop, which provides end‐to‐end analysis of RNA sequencing data sets. The workflow allows the user to perform quality control, perform differential expression analyses and call genomic variants. Additional options include the calculation of allele frequencies of variants of interest, summaries of genetic variation and population structure, and genome‐wide selection scans, together with clear visualizations. RNA‐Seq‐Pop is applicable to any organism, and we demonstrate the utility of the workflow by investigating pyrethroid resistance in selected strains of the major malaria mosquito, Anopheles gambiae. The workflow provides additional modules specifically for An. gambiae, including estimating recent ancestry and determining the karyotype of common chromosomal inversions. The Busia laboratory colony used for selections was collected in Busia, Uganda, in November 2018. We performed a comparative analysis of three groups: a parental G24 Busia strain; its deltamethrin‐selected G28 offspring; and the susceptible reference strain Kisumu. Measures of genetic diversity reveal patterns consistent with that of laboratory colonization and selection, with the parental Busia strain exhibiting the highest nucleotide diversity, followed by the selected Busia offspring, and finally, Kisumu. Differential expression and variant analyses reveal that the selected Busia colony exhibits a number of distinct mechanisms of pyrethroid resistance, including the Vgsc‐995S target‐site mutation, upregulation of SAP genes, P450s and a cluster of carboxylesterases. During deltamethrin selections, the 2La chromosomal inversion rose in frequency (from 33% to 86%), supporting a previous link with pyrethroid resistance. RNA‐Seq‐Pop is hosted at: github.com/sanjaynagi/rna‐seq‐pop. We anticipate that the workflow will provide a useful tool to facilitate reproducible, transcriptomic studies in An. gambiae and other taxa.


| INTRODUC TI ON
Transcriptomics is central to our understanding of how genetic variation influences phenotype (Stark et al., 2019). In recent years, RNAsequencing (RNA-Seq) has replaced microarray technologies for whole-transcriptome profiling, providing a relatively unbiased view of transcript expression (Zhao et al., 2014) with associated higher sensitivity and greater dynamic range (Lowe et al., 2017). The utility of RNA-Seq is exemplified by the vast amounts of data accruing (Van den Berge et al., 2019), and in the many discoveries it has revealed-such as the extent of alternative splicing, and the biology of noncoding RNAs (Stark et al., 2019;Wang et al., 2010;Wang & Burge, 2008).
In recent years, various computational workflows have been developed to analyse RNA-Seq data in a reproducible manner (Lataretu & Hölzer, 2020;Zhang & Jonassen, 2020), but these workflows are designed with the primary aim of differential expression analysis (DEA) and leave a large amount of untapped sequence-based information. A study previously detected population genomic signals in RNA-Seq data, although this specific application remains rare (Thorstensen et al., 2020). In our own area of research, vector genomics, a scan of the literature revealed 33 RNA-Seq studies (Table   S1), of which only five interrogated the sequence data (Bonizzoni et al., 2015;David et al., 2014;Faucon et al., 2017;Kang et al., 2021;Messenger et al., 2021). A barrier to exploiting the full range of information contained within RNA-Seq data sets has been the absence of comprehensive, user-friendly pipelines which permit easily reproducible analysis (Grüning et al., 2018) and enable comparisons across studies.
In this study, using the workflow management system Snakemake (Mölder et al., 2021), we present a reproducible computational workflow, RNA-Seq-Pop, for the analysis of short-read RNA-Seq data sets of any organism. The workflow is applicable to single or paired-end RNA-Seq data, such as those produced on Illumina or MGI (DNB-Seq) instruments. However, we also present modules specifically of interest in the analysis of the major malaria mosquito, Anopheles gambiae s.l., and demonstrate their use in a study of pyrethroid-resistance in a strain of An. gambiae from Busia, Uganda.
Pyrethroids are the most widely used class of insecticide in malaria control, and over the past two decades, resistance in malaria vectors has spread throughout sub-Saharan Africa, posing a threat to vector control efforts (Ranson, 2017). In this period, the incrimination of genes involved in insecticide-resistant phenotypes of An. gambiae has been based primarily on transcriptomic studies. For many years, these were performed using microarrays, synthesis of which has highlighted the repeatable overexpression of a handful of genes involved in detoxification, confirming well-established cytochrome P450s as candidates, whilst also implicating more diverse genes such as ABC transporters and sensory appendage proteins (Müller et al., 2008, Ingham et al., 2018). Yet to date, relatively few diagnostic markers have been identified, and important genes have been missed by standard transcriptomic analyses (Njoroge et al., 2022). These shortcomings illustrate the need for a more comprehensive approach to marker discovery. While whole-genome sequencing is providing valuable information on known and novel resistance variants (Clarkson et al., 2021; The Anopheles gambiae 1000 Genomes Consortium, 2020), exploiting the sequence data within RNA-Seq can help bridge the step from transcriptomics to genomics.
In Uganda, pyrethroid resistance has escalated in recent years (Lynd et al., 2019;Tchouakui et al., 2021). As well as the Vgsc-995S mutation, which has repeatedly been associated with pyrethroid resistance, recent genomic studies from this region have shown that a triple-mutant haplotype, linking a transposable element, a gene duplication (Cyp6aa1) and a nonsynonymous mutation Cyp6p4-I236M, is an important marker of pyrethroid resistance (Njoroge et al., 2022).
A single nucleotide polymorphism (SNP)-array based genome-wide association study (GWAS) also demonstrated the Cyp4J5-L43F mutation to be a useful marker for insecticide resistance, whilst also implicating the 2La inversion karyotype as a potential marker (Weetman et al., 2018). We use RNA-Seq-Pop to uncover patterns of insecticide resistance in Ugandan An. gambiae, monitoring these resistance-associated mutations, whilst performing differential expression analyses, summarizing genetic variation and ancestry, and karyotyping chromosomal inversions.

| RNA-Seq-Pop implementation
We designed the RNA-Seq-Pop workflow according to Snakemake best practices (Köster, 2022). RNA-Seq-Pop is constructed with a single configuration file in human-readable yaml format (the config file), alongside a simple tab-separated text file containing sample metadata (the sample sheet). The overall RNA-Seq-Pop workflow is shown in Figure 1.
Dependencies are internally managed by the package manager conda; to instal all required software, specify the --use-conda directive at the command line, and conda will automatically create isolated software environments in which to run. As of version 1.0.4, RNA-Seq-Pop modules are written in Python (81.2% of the codebase) and R (18.8%), and internally, the workflow utilizes a library (RNASeqPopTools) which provides the infrastructure to the Python codebase, to ensure readability. We provide documentation to guide users on how to set up and run RNA-Seq-Pop: https://sanja ynagi. github.io/rna-seq-pop.

| Quality control
The workflow begins by checking concordance between the userprovided sample metadata, configuration file, and reference and fastq files. Quality control metrics of fastq files are calculated with fastqc (Andrews, 2010), and logs and statistics from eight tools in the workflow are integrated into a report with multiqc (Ewels et al., 2016). Raw fastq reads may be optionally trimmed with cutadapt (Martin, 2011), with the option of specifiying custom adaptor sequences.

| Differential expression
Trimmed reads are aligned to the reference transcriptome with kallisto (Bray et al., 2015) and differential expression performed F I G U R E 1 The RNA-Seq-Pop workflow and example outputs. The workflow has been designed for ease of use, requiring only a configuration file to set up workflow choices and a sample sheet to provide sample metadata. Modules highlighted in green are specific to Anopheles gambiae s.l.
at the gene level with deseq2 (Love et al., 2014) and at the isoform level with sleuth (Pimentel et al., 2017). The gene-level counts are normalized to account for sequencing depth, and principal components analysis (PCA) and Pearson's correlation performed among all samples, and on subsets of the user-selected treatment groups used in differential expression analysis. Plots of these analyses are useful for exploratory data visualization, providing an additional quality control step to ensure expected relationships between samples.
RNA-Seq-Pop combines differential expression results from multiple pairwise comparisons into an excel spreadsheet for the user, as well as generating individual .csv files, volcano plots, and identifying the number of differentially expressed genes at various false discovery rate (FDR)-adjusted p-value thresholds. The user may create venn diagrams for multiple comparisons and generate heatmaps if a list of genes is provided. We use the hypergeometric test for Gene Ontology (GO) term enrichment analysis, on genes that are significantly over-expressed based on FDR-adjusted p-values and, and optionally, the top five percentile of F ST values.

| Variant calling
Reads are aligned to the reference genome with hisat2 (Kim et al., 2019) and read duplicates marked with samblaster (Faust & Hall, 2014) producing binary alignment files (BAM), which are sorted by genomic coordinate and indexed with samtools version 1.19 (Danecek et al., 2021). SNPs are then called with the Bayesian haplotype-based caller freebayes version 1.3.2 (Garrison & Marth, 2012). SNPs are called jointly on all samples, with different treatment groups called as separate populations, at the ploidy level provided by the user in the configuration file. The workflow internally parallelizes freebayes by splitting the genome into small regions, greatly reducing overall computation time. The separated genomic regions are then concatenated with bcftools version 1.19 (Danecek et al., 2021) and the final VCF piped through vcfuniq (Garrison et al., 2021), to filter out any duplicate calls that may occur at the genomic intervals between chunks. Called variants are then annotated using snpeff version 5.0 (Cingolani et al., 2012).

| Variant analysis and selection
RNA-Seq-Pop can then perform analyses on the variants called by freebayes. We apply filters to the data, including restricting to SNPs (excluding indel calls) and applying missingness and quality filters.
We recommend using a quality score of 30 and a missingness proportion of 1, meaning a variant call (reference or alternate allele) must be present in each sample, that is there are no missing allele calls. For each pairwise comparison specified in the config file, the workflow can perform a windowed Hudson's F ST scan (Bhatia et al., 2013;Hudson et al., 1992) along each chromosomal arm, outputting windowed F ST estimates and genome-wide plots. Population branch statistic (PBS) scans may also be performed, conditional on the presence of three suitable populations for the phenotype(s) of interest (Yi et al., 2010). It is also possible to run Hudson's F ST and PBS scans, taking the average for each protein-coding gene, rather than in windows. All population genetic statistics are calculated in scikit-allel version 1.2.1. . We also provide a script (geneScan.py) to interrogate the VCF files, reporting missense variants from any gene of the user's choice. A tab-separated file of variants of interest can be provided, from which the workflow will produce allele frequency heatmaps for each biological replicate and averaged across treatment groups. We define the expressed allele balance as the allele frequency at a genomic location in the aligned read data-for this analysis, RNA-Seq-Pop does not use variants called by freebayes, but instead calculates the proportion of each allele directly in bam files to ease intepretation. An example variant of interest file for Anopheles gambiae is provided in the RNA-Seq-Pop GitHub repository.
All analyses described thus far can be conducted across all eukaryotes of any ploidy, requiring only a reference genome (.fa), transcriptome (.fa) and genome annotation files (.gff3).

| Anopheles gambiae s.l. specific analyses
For An. gambiae s.l. data sets we have exploited the Anopheles gambiae 1000 genomes resource  The Anopheles gambiae 1000 Genomes Consortium, 2020), to incorporate H12 and iHS (Garud et al., 2015) genomic selective sweep analysis. The workflow outputs the differentially expressed gene's genomic location, the specific sweep signals that are present in the Ag1000g resource at that genomic location, and whether the region is a known insecticide resistance-associated locus. We caution that this kind of analysis is exploratory: many genes will be contained within selective sweeps, and may not have a causal link to phenotypic variation.

| Population structure, ancestry and karyotyping
To investigate population structure, we apply SNP quality and missingness filters to the SNP data, which can be configured by the user. Multiple measures of population genetic diversity are estimated for each sample, such as nucleotide diversity (π), Watterson's θ (Watterson, 1975) and inbreeding coefficients. We then prune SNPs in high linkage by excluding variants above an R 2 threshold of .01 in sliding windows of 500 SNPs with a step size of 250 SNPs, and perform a PCA on the remaining SNPs. If the analysed species is An. gambiae, An. coluzzii or An. arabiensis, the pipeline can implement an analysis of putative ancestry informative markers (AIMs). The AIMs were derived from two different data sets.
The An. gambiae/An. coluzzii AIMs derive from the 16 genomes project  and in West Africa may distinguish between individuals with An. gambiae or An. coluzzii ancestry. The An. gambcolu/An. arabiensis AIMs are derived from phase 3 of the Anopheles gambiae 1000 genomes project, and distinguish between individuals with either An. gambiae or An. coluzzi ancestry from An. arabiensis. The relative proportion of ancestry is reported and visualized for the whole genome by chromosome. We modified the program compkaryo  to enable the identification of common inversions on chromosome 2 in pooled samples.

| Mosquito lines
As a case study to the workflow, we sequenced a pyrethroidresistant colony of An. gambiae s.s. from Busia, Uganda, alongside the standard multi-insecticide-susceptible reference strain, Kisumu.
After 24 generations in the colony, we stored RNA from unexposed Busia individuals (G24 Busia parental). Unexposed, age-matched Kisumu females were used as controls. We then selected the remaining Busia G24 colony using 0.05% deltamethrin papers in WHO tube assays for four generations (full details of the selection regime can be found in the Text S2). We exposed females from the selected generation (G28) for 1 h to 0.05% deltamethrin WHO papers using standard protocols, left for 24 h after exposure, and survivors were stored at −80°C prior to RNA extraction (G28 Busia selected survivors). Selections were perfomed following mating.

| Library preparation
We extracted RNA from pools of five, 4-day-old female mosquitoes using a Picopure RNA isolation kit (Arcturus, Applied Biosystems).
We performed six replicates for each Busia-derived treatment group, and four for Kisumu. Library quality and quantity were determined on a Tapestation 2200 (Agilent) using high-sensitivity RNA screentape. Paired-end 150-bp RNA-Seq libraries were prepared and sequenced by Novogene (https://en.novog ene.com/), on an Illumina NovaSeq 6000 system.

| Busia resistance phenotyping
The parental G24 Anopheles gambiae Busia strain had lost much of its pyrethroid resistance during the time in culture and exhibited susceptibility to deltamethrin (100% mortality, 96.3-100 95% confidence intervals [CIs]) and low-prevalence resistance to permethrin (92.6% mortality, 85.6-96.4 95% CIs). Four generations of deltamethrin selections demonstrated this loss to be readily reversible and resulted in a G28 selected Busia strain that showed increased resistance to both deltamethrin (69.7% mortality, 63.2-75.6 95% CIs) and permethrin (21.7% mortality, 14.9-30.5 95% CIs) when exposed for 1 h in WHO tube assays. Further colony information can be found in Table S2. We compared the G24 Busia parental strain with the G28 Busia selected strain, and both Busia strains to the pyrethroidsusceptible reference strain, Kisumu.

| RNA-Seq
As an illustrative example of the modules and output of the RNA-Seq-Pop workflow, we will describe the analysis of the Busia RNA-Seq data set.

| Quality control
We used RNA-Seq-Pop to import FASTQ data files into fastqc (Andrews, 2010) to determine levels of adaptor content, quality scores, sequence duplication levels and GC content in the raw read data. After genome alignment, we applied rseqqc and samtools to collect mapping statistics from the resulting BAM files. We then integrated multiqc into the workflow, which collates statistics and results from eight tools to generate a convenient, interactive (.html) quality control report. Figure 2 shows reports generated by multiqc on the Busia An. gambiae data set.
We removed adapter sequences and aligned paired-end reads to the An. gambiae PEST reference transcriptome (AgamP4.12) ( Figure 2b). In total, 844.25 million reads were processed in total, with 727.84 million successfully aligned, giving an overall alignment rate of 85.58% (±0.206% standard error) across 16 replicates in total. The breakdown of reads counted per sample can be found in Figure S3.
As a further quality control step, and to uncover the overarching relationships of gene expression between samples, RNA-Seq-Pop performs a PCA (Figure 3a), and a sample-to-sample correlation heatmap ( Figure 3b) on the deseq2-normalized read count data. In both analyses, biological replicates of each treatment group clustered together, supporting robust replication in these samples.

| Differential expression
We compared the G24 parental Busia strain to the G28 Busia survivors, and also to the laboratory-susceptible Kisumu, which provides a cross-reference with earlier studies, as well as an extra level of filtering to identify candidate genes. Using an adjusted p-value threshold of .05, our deseq2 differential expression analysis (Wald test) identified 5416 differentially expressed genes between Kisumu and the parental Busia line and 5657 between the parental Busia and the G28 selected Busia survivors. The full table of differentially expressed genes in all comparisons can be found in Data S1, and volcano plots in Figure S4a-c.
We also provide a module which summarizes gene expression in specific gene families if provided with a table mapping genes to pfam domains. We provide an example mapping file. In addition to this module, the user may also supply a list of transcripts, and RNA-Seq-Pop will produce a heatmap on the normalized read count data ( Figure S5).
To investigate the similarity of differential expression compari-   Table S5). A PCA based upon read count data was not qualitatively different from the PCA on expression data (Figure 3 and Figure S8). Table 1 shows genome-wide nucleotide diversity (π) and
To standardize sample size we down-sampled both Busia strains from six to four replicates. Both measures of genetic diversity were significantly lower in the Kisumu strain compared to the two Busia strains, as would be expected after a long history of laboratory colonization. The G28 selected Busia survivors also showed a reduction in genetic diversity compared to the unexposed G24 parental Busia colony.

F I G U R E 3 Exploratory sample clustering. (a) Principal components analysis of the normalized read count data, showing clear separation between conditions. (b)
A sample-to-sample Pearson's correlation heatmap of normalized read counts assigned to each biological replicate; dendrograms show hierarchical clustering applied directly to Pearson's correlations.

F I G U R E 4
Summary of gene expression in gene families of interest. Using PFAM domains, we extract genes from specific gene families, and summarize fold-change (left) and normalized read count data (right). Genes are ordered by hierarchical clustering of the normalized read count data, and clustering results are displayed by the dendrogram (far left). In the fold-change plot (left), green = overexpression in the second group, and purple = underexpression.

| Known insecticide resistance variants of interest
If provided with a list of user-defined variants of interest, RNA-Seq-Pop will generate reports and plots of allele balance (the allele frequency found in the read alignments). For our variants of interest, we curated a selection of SNPs which have been associated with insecticide resistance in previous studies. Figure 6 shows allele frequencies of variants of interest across all samples. We show that over the four generations of selections and after insecticide exposure, the frequency of the Vgsc-995S kdr allele increased from 25% (95% CIs: 21.5%-29.8%) in G24 to 100% in the G28 Busia survivors. In agreement with recent work from the Ag1000g project, we found no known secondary kdr mutations alongside the Vgsc-995S allele (Clarkson et al., 2021).
The Ace-1-G280S mutation was absent from all samples. A single allele of the rdl-A296G mutation was detected in the parental Busia strain. Complete allele balance data for all variants of interest can be found in Data S2. We looked within the primary candidate gene from differential expression analysis, Sap2, for allele frequency changes, but no nonsynonymous variants were present in the data.

| Selection
The workflow permits calculation of F ST and PBS both in windows as genome-wide selection scans (GWSS) and within each gene. In the context of insecticide resistance, finding regions of high genetic differentiation between susceptible and resistant mosquito populations can allow us to identify loci or variants that contribute to the phenotype.
We found high overall levels of F ST between the G24 parental Busia and the G28 selected Busia survivors, but F ST on chromosomal arm 2L was especially elevated as compared to the other arms (Figure 7), with large F ST signals around the Vgsc and 2La inversion. In other data sets from F 1 Figure S13), the genome-wide selection scans are able to capture signals at sites of known selective sweeps.

| Chromosomal inversions
We estimated the karyotype of the samples with compkaryo for the 2La and 2Rb chromosomal inversions, by extracting karyotypetagging SNPs. Karyotyping-tagging SNPs are alleles located within the inversion breakpoints, which show fixed (or almost fixed) differences between the inverted and standard karyotypes. We focus on these two inversions because both contain a large number of tagging SNPs, providing confidence in the overall calls.   Figure S12 shows the per-replicate karyotype frequency.

| Ancestry
Ancestry informative markers are SNPs which show fixed (or almost fixed) differences between species. RNA-Seq-Pop can utilize sets of AIMs to investigate the proportion of ancestry for each chromosome assigned to either An. gambiae, An. coluzzii or An. arabiensis. Figure 8 shows the position of called AIM alleles that map to either An. gambiae or An. coluzzii across the genome. This shows that the Busia samples were primarily of An. gambiae s.s. ancestry across all chromosomes, in concordance with the X chromosome-based SINE species ID assay (Santolamazza et al., 2008). However, the pattern was markedly different for the susceptible reference strain, Kisumu, which showed a large degree of putative An. coluzzii ancestry on the autosomes (Table S11). RNA-Seq-Pop is designed for ease of use, requiring only a sample metadata sheet and a yaml format configuration file. A single command in the terminal will automatically instal all dependencies and run the workflow, which is scaled by Snakemake to run on a personal computer, cluster or cloud environment. The workflow is applicable to single or paired-end RNA-Seq data from any organism, allowing for variation in ploidy, including haploid, diploid or pooled samples.

RNA-Seq
We have written RNA-Seq-Pop in accordance with Snakemake best practices (Köster, 2022), and hope that it is an intuitive program, Typically, we can only detect these inversions through molecular PCR-based assays (of which many do not exist for the range of inversions karyotyped by compkaryo) or laborious and technically challenging cytological experiments (Coluzzi et al., 2002;White et al., 2007), although recent approaches using tagging SNP panels appear promising (Love et al., 2020).
We can also illuminate the putative ancestry of our samples. This is of particular interest as the two recently diverged sibling species An. gambiae and An. coluzzii may often hybridize, and have undergone extensive introgression in the recent past (Fontaine et al., 2015;Vicente et al., 2017), allowing resistance alleles to cross from one species to another (Clarkson et al., 2014;Grau-Bové et al., 2020. Despite this, molecular assays typically target only a single marker on the X chromosome, ignoring the potential for admixture elsewhere in the genome (Caputo et al., 2021;Chabi et al., 2019;Santolamazza et al., 2008). As genome resources advance in other vectors, such as Aedes aegypti and Culex pipiens, we will expand the ancestry-informative marker analysis of RNA-Seq-Pop to these species complexes.

| Patterns of resistance in the Busia data set
Differential expression analysis highlighted a multitude of detoxification genes overexpressed in the selected Busia survivors, including cytochrome P450s, carboxylesterases, chemosensory proteins and ABC transporters, reflecting the broad nature of the response to insecticide exposure. Many P450 genes were ≈2-fold overexpressed and it is not known whether this is due to constitutive differences between the strains, or induction by deltamethrin exposure in the G28 Busia strain. The Sap2 gene in particular was highly overexpressed (10.7-fold after deltamethrin selections and exposure), and thus serves as a strong candidate for pyrethroid resistance outside of the West African An. coluzzii populations in which it was originally identified (Ingham et al., 2020). Sap2 is known to be induced upon insecticide exposure, although the relative importance of selection vs. induction by exposure cannot be determined from this experimental design. The increase in Vgsc-995S kdr allele frequency (previously 1014S) following selections and segregation after exposure is as predicted given its known association with pyrethroid resistance. Interestingly, the G28 selected Busia strain showed a much stronger phenotype against permethrin than deltamethrin (Table S2A) (Table S10).
Interestingly, RNA-Seq-Pop revealed that the Kisumu reference strain exhibits a large proportion of putative An. coluzzii ancestry.
The Kisumu reference strain was colonized from Kisumu, Kenya, in 1975(Williams et al., 2019 from an area where An. coluzzii has not been recorded. The most parsimonious explanation is that the colony has been contaminated through hybridization in the insectary during its long colonization. The X chromosome is typically resistant to introgression, and consistent with a theory of a laboratory contamination event, no An. coluzzii variants are found on the X chromosome. The X chromosome of Kisumu also has a particularly low estimate of Watterson's Θ compared to the autosomes, which may reflect admixture present on the autosomes (Table S7A). In addition, we also find that the Kisumu strain contains the Gste2-114T mutation at high frequency. In agreement with this finding, recent data show occasional low-level resistance to DDT in this strain (Williams et al., 2019). We also observe some putative An. coluzzii alleles in the two Busia strains. Whilst we cannot rule out other explanations, this set of ancestry-informative markers were derived from Mali, and therefore it is likely that some may not be truly informative of ancestry outside of this population.
In this study, we exposed the G28 selected strain in order to maximize the resistance phenotype and strengthen the genotypephenotype association. This design choice, however, may mean estimates of allele and karyotype frequencies are overestimates and not necessarily reflective of either the unexposed G28 or G29 Busia strains. This is because susceptible G28 mosquitoes have not survived, and G28 survivors have probably already mated, meaning susceptible alleles may be passed on to the next generation, affecting allele frequencies. Equally, insecticide survivors may not go on to produce offspring. In general, we recommend sequencing appropriate controls where possible-for example, in our case, including a G28 Busia unexposed group.
When analysing RNA-Seq data we only have read coverage in expressed parts of the genome, primarily in exons, and so we can only call genetic variants in these regions. Although not ideal, given that we expect the majority of functional variants to exist in expressed regions of the genome (Choi et al., 2009), this is not a necessarily a major drawback. Indeed, this is the premise of exome sequencing, in which only protein-coding parts of the genome are targeted to sequence. Additionally, estimated population allele frequencies derived from RNA-Seq data may not accurately reflect DNA-based allele frequencies. Allele-specific expression is one cause of this, where two or more alleles in a diploid or polyploid may be expressed at different levels, causing an imbalance. Despite this, previous studies have shown a strong correlation between expressed and true allele frequencies, particularly at higher sequencing depth (Jehl et al., 2021;Lopez-Maestre et al., 2016;Oikkonen & Lise, 2017;Quinn et al., 2013). In this study, we performed RNASeq at a high sequencing depth, and therefore can have more confidence overall in our genotype calls and subsequent analyses. We recommend generally that for DEAs, lowcoverage RNA-Seq is sufficient (10-25 million reads, or 5-13.5× coverage for An. gambiae), with a higher number of biological replicates (four or more). For variant analyses, higher coverage is preferred (25-60 million reads, or 13.5-32.4× coverage for An. gambiae), with a high number of individuals pooled per replicate (≥10).
Although other studies present strategies to call variants from RNA-Seq data (Brouard & Bissonnette, 2022;Jehl et al., 2021;Piskol et al., 2013;Quinn et al., 2013), none of these studies present convenient, reproducible bioinformatic pipelines to implement their suggested strategies, instead requiring the user to manually perform each step. In addition to that, we found no studies that present pipelines to call variants and also perform analyses on the resulting SNP data. Although a previous study showed that population genomic signals can be extracted from RNA-Seq data, they did not package their analysis into any usable workflow, and the analyses themselves are limited in comparison to RNA-Seq-Pop (Thorstensen et al., 2020).
Based on the lack of comparable, easy-to-use workflows, we envisage that RNA-Seq-Pop will prove useful to many researchers in a range of biological fields.

ACK N O WLE D G E M ENTS
This work was supported by the National Institute of Allergy

CO N FLI C T O F I NTE R E S T
We declare no conflict of interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The workflow is hosted at https://github.com/sanja ynagi/ rna-