Contrasting gene‐level signatures of selection with reproductive fitness

Abstract Selection leaves signatures in the DNA sequence of genes, with many test statistics devised to detect its action. While these statistics are frequently used to support hypotheses about the adaptive significance of particular genes, the effect these genes have on reproductive fitness is rarely quantified experimentally. Consequently, it is unclear how gene‐level signatures of selection are associated with empirical estimates of gene effect on fitness. Eukaryotic data sets that permit this comparison are very limited. Using the model plant Arabidopsis thaliana, for which these resources are available, we calculated seven gene‐level substitution and polymorphism‐based statistics commonly used to infer selection (dN/dS, NI, DOS, Tajima's D, Fu and Li's D*, Fay and Wu's H, and Zeng's E) and, using knockout lines, compared these to gene‐level estimates of effect on fitness. We found that consistent with expectations, essential genes were more likely to be classified as negatively selected. By contrast, using 379 Arabidopsis genes for which data was available, we found no evidence that genes predicted to be positively selected had a significantly different effect on fitness than genes evolving more neutrally. We discuss these results in the context of the analytic challenges posed by Arabidopsis, one of the only systems in which this study could be conducted, and advocate for examination in additional systems. These results are relevant to the evaluation of genome‐wide studies across species where experimental fitness data is unavailable, as well as highlighting an increasing need for the latter.

molecular basis of adaptation, without requiring prior knowledge of the nature of selection or the phenotype expected to respond.
Most neutrality tests can be classified into two main categories (Biswas & Akey, 2006;Booker et al., 2017;Pavlidis & Alachiotis, 2017;Stephan, 2016). The first category identifies a faster rate of evolution in certain genes or genomic regions when compared to a baseline (estimated by the genome-wide rate of evolution, or the rate of evolution at synonymous sites, which are expected to be neutral or nearly neutral). This category includes the most commonly used method for detecting the signature of selection, dN/dS (Suzuki & Gojobori, 1999), as well as other estimators requiring substitution and polymorphism data such as two variants of the McDonald-Kreitman (MK) test applied in (Stoletzki & Eyre-Walker, 2011): "direction of selection" (DOS) and a "neutrality index" (NI), Haldane's estimator of the logtransformed odds ratio of the MK contingency table (Haldane, 1956).
The second type of test, known as a "frequency spectrum" test, aims to identify a selective sweep: the fixation of a beneficial allele under strong (positive) selection, which is characterised by a relative reduction in variation in comparison to the surrounding regions (Maynard Smith & Haigh, 1974). Tests of this category require only polymorphism data and include Tajima's D (Tajima, 1989), Fu and Li's D* (Fu & Li, 1993), Fay and Wu's H (Fay & Wu, 2000), and Zeng's E (Zeng et al., 2006). An overview of these methods and their interpretation is given in Supporting Information. Both types of approaches to detecting the signature of selection assume that most genes are evolving neutrally (so that those under selection can be differentiated from the baseline), an assumption that has been contested by some (e.g. Kern & Hahn, 2018;Nordborg et al., 2005).
Since every method has limitations, a multipronged approach is recommended (Vitti et al., 2013). However, because there are a limited number of species for which data is available to evaluate multiple types of signature of selection, the choice of method to detect selection is often a pragmatic one, determined by the type of sequence data available for that species and its close relatives.
Methods that only require sequence divergence between species, or polymorphism data between individuals of the same species, are the most common (Nielsen, 2005).
While neutrality test statistics are frequently used to support hypotheses about the adaptive significance of particular genes, what this means phenotypically is often unclear. In the absence of supporting data, caution has previously been raised about the seductiveness of "just so" stories: superficially plausible interpretations of signatures of selection (Smith, 2016). Whole genome scans may rapidly identify (superficially) plausible candidate genes for positive selection without prior knowledge of any associated traits, but the effect these genes have on reproductive fitness is rarely tested experimentally. Generally speaking, there is a growing awareness of the need for fitness data to clarify the specific case for adaptation (Lee-Yaw et al., 2019). Our aim with this study is to bolster this case by directly contrasting seven gene-level signatures of selection with empirical estimates of gene effect on fitness, made using insertion mutation lines.
There are a very limited number of eukaryotic systems in which it is possible to compare multiple signatures of selection and reproductive fitness. For the purpose of this study, we use the model plant Arabidopsis thaliana. While A. thaliana represents short-lived annuals, the sequence and molecular tools available to examine phenotypes in this system are more extensive than any other plant. Sequenced genomes are available both for A. thaliana (The Arabidopsis Genome Initiative, 2000) and its two sister species A. lyrata (Hu et al., 2011) and Arabidopsis halleri (Briskine et al., 2017), which diverged from A. thaliana approx. 5.8 Ma (Kumar et al., 2017) and 5-18 Ma (Honjo & Kudoh, 2019), respectively. In addition, genome-wide resequencing data are available for over 1000 A. thaliana accessions from a range of ecologically diverse habitats Gan et al., 2011; The 1001 Genomes Consortium, 2016). These resources enable genome-wide scans to identify genes under positive or purifying selection using methods based on substitution and polymorphism data. Crucially, a large collection of A. thaliana knock-out (KO) lines in a common background are available (Alonso et al., 2003;O'Malley et al., 2015), as well as quantitative fitness estimates (fruit production) for lines with a major insertion mutation in individual genes, in the form of the unPAK data set (Rutter et al., 2019). The unPAK data set differs from previous estimates of gene effects using mutant lines in other model systems (e.g. Conant & Wagner, 2004;Giaever & Nislow, 2014;Sanson et al., 2018) in that it does not address gene essentiality but whether an insertion mutation in a particular gene increases or decreases fitness measures.
Our results should be viewed as an initial foray into the problem of reconciling neutrality test statistics with direct fitness estimates at single-gene resolution. We discuss our findings in the context of Arabidopsis biology and advocate for examination in additional systems. These results are relevant to the evaluation of genome-wide studies across species where experimental fitness data is unavailable, as well as highlighting an increasing need for the latter.

| Gene orthology annotation
Orthology relationships between A. thaliana and A. lyrata were obtained from Ensembl BioMart (Kinsella et al., 2011). Only genes with a reported one-to-one orthology with ≥75% reciprocal identity were included in analyses. Gene orthologues were further filtered based on whole-gene dN/dS estimates, obtained from Ensembl v92 (Zerbino et al., 2018), to retain only those where 2 > dS > 0.02 and dN <2 (n = 17,824 genes).

| Tests of sequence evolution and selection
We created a data set comprising up to seven measures of sequence evolution per protein-coding gene. Pairwise dN/dS estimates were first calculated for the coding regions of A. thaliana and (where available) its orthologue in both A. lyrata and A. halleri using the PAML package (Yang, 2007). To do so, the longest CDS of each orthologous pair was aligned end-to-end using the Needleman-Wunsch algorithm, as implemented by EMBOSS needle v6.6.0 (Rice et al., 2000) with default parameters. CDS-level dN/dS was estimated from these alignments using the Yang and Nielson model, implemented by the yn00 module of PAML v4.9h (Yang, 2007 To calculate the two measures of sequence evolution that combine allele frequencies with substitutions: the neutrality index, NI (Haldane, 1956), and DOS (Stoletzki & Eyre-Walker, 2011), we used the    (Haldane, 1956), where D n and D s are the numbers of nonsilent and silent substitutions (used to calculate the pairwise dN/ dS), and P n and P s are the numbers of nonsilent and silent polymorphisms. DOS was calculated as D n /(D n + D s )−P n /(P n + P s ), where D n and D s are the numbers of nonsilent and silent substitutions, and P n and P s are the numbers of nonsilent and silent polymorphisms (Stoletzki & Eyre-Walker, 2011).
Using the full set of VCFs from the 1001 Genomes Project, we calculated four additional measures of sequence evolution based on allele frequencies: Tajima's D (Tajima, 1989), Fu and Li's D* (Fu & Li, 1993), Zeng's E (Zeng et al., 2006)

| Gene ontology (GO) term enrichment
To assess whether any GO terms were enriched among the seven

| Testing whether essential genes show signatures of purifying selection
Essential genes should be under stronger purifying selection. We used a list of essential genes in A. thaliana compiled by (Lloyd et al., 2015) to assess whether they had a lower dN/dS than would be expected by chance, using a randomisation test (as in Bush et al., 2014).
There are 591 essential genes with high-confidence dN/dS estimates (out of 705 essential genes in total). Thus, subsets of 591 genes were drawn at random s = 10,000 times from the set of 17,630 genes for which a high-confidence CDS-level dN/dS estimate was available.
We calculated q, the number of times a randomly chosen subset had a lower median dN/dS than the subset of essential genes. Letting H and E could be estimated.

| Gene effect on fitness
To determine the effect of different genes on fitness, we used the unPAK data set (http://arabi dopsi sunpak.org/, last accessed 20 November 2018) (Rutter et al., 2019). unPAK provides estimates of the total number of fruits produced by plants with different genes knocked out, as well as the ancestral wild type (accession COL70000, the background against which all KOs were made).  Table S1.
We found only marginal similarity in gene ranking among most estimators ( Figure 1). For instance, the correlation between dN/dS and Tajima's D, which are among the most widely used estimates, is a negligible Spearman's rho = -0.02. Only two estimators, Fay and Wu's H and Zeng's E, were strongly concordant (rho = -0.98).

| Essential genes show molecular signatures consistent with purifying selection
A list of 705 'essential' genes, whose disruption prevent the completion of the life cycle, have been compiled for A. thaliana by Lloyd et al. (2015). This set of genes should show sequence evolution signatures of purifying selection (Hurst & Smith, 1999;Wilson et al.,1977) and thus provide a good validation for the different methods used. Using a randomisation test, we found that, consistent with stronger purifying selection, the subset of essential genes had a higher median NI than a randomly chosen subset (p = 1 × 10 −4 ; Table 1). These genes also had a lower median dN/dS and lower median DOS (p = .002; Table 1). However, there was no significant enrichment of essential genes within the distributions of the four frequency spectrum estimators (D, D*, H and E). We interpret this as reflecting the high sensitivity of these methods to demographic effects, which suggest they are less appropriate to detect selection in Arabidopsis (see Supporting Information).

| Identifying genes under positive selection
The  Table S1 TA B L E 1 Testing whether "essential genes" show molecular signatures of purifying selection

Measure (no. of Arabidopsis thaliana accessions used in calculation, if applicable)
Median value for subset of essential genes

F I G U R E 2
The distribution of fitness estimates does not significantly differ between genes with signatures of selection (as determined by conventional thresholds) and genes without the signature of selection. Raw data for this figure is available in Table S1. For seven different indices of sequence evolution, we used Kruskal-Wallis tests to assess the null hypothesis that the two sets originate from the same continuous distribution. The null hypothesis was not rejected for any measure:   (Table S1). However, the overlap across methods is poor, and only 29 genes are classed as under selection by all seven methods (Table S1 and Supporting Information). In addition to different methods identifying different genes, we also observe little overlap in enriched Gene Ontology (GO) biological process terms for gene sets identified as under selection by each of the seven indexes (Table S2).

| No association between genelevel measures of sequence evolution and reproductive fitness
We tested whether genes with signatures of selection have higher impact on fitness than genes evolving more neutrally. Fitness was estimated as the number of fruits produced by lines with insertion mutations overlapping specific genes (KO lines) relative to the fruit production of the ancestral line, which lacked them (WT line) (Table   S3). Using conventional thresholds for each of the seven methods (dN/dS > 1, DOS > 0, NI < 0, D < 0, D* < 0, H < 0 and E < 0) we obtained sets of genes with signatures of selection, and genes without, then compared the distribution of fitness estimates between them ( Figure 2). We found no significant differences in the median fitness for either of the seven sets (Kruskal-Wallis p > .05 in all cases). This conclusion was robust to the use of lineage-specific dN/dS values (using T. parvula [Dassanayake et al., 2011] as an outgroup to discriminate between substitutions which occurred on the thaliana or lyrata lineage; Figure S1), to the use of alternative estimates of DOS and NI (using substitution data relative to A. halleri and/or polymorphism data from an independent subset of 19 accessions [Gan et al., 2011]; Figure S2), and when using an outlier approach to classifying a gene as potentially positively selected, rather than conventional thresholds (considering only the top 5% of genes in each distribution; Figure S3). This conclusion was also robust to controls for population structure within the 1001  Figure S4). In addition, the raw fitness estimates adhere closely to a normal distribution (Table S3) so their additional standardisation, as Z-scores (to account for phenotypic variation by growth chamber), does not alter these findings ( Figure S5). Further corroborating this result, we also found no significant correlations between selection estimates and the observed effect on fitness for any of the seven methods (n = 379 genes; Figure 3).
It is possible for KO lines to have more than one insertion site, or for unknown extra insertions to introduce fitness estimate errors.
Thus, we also recalculated the correlations between selection estimates and fitness after excluding lines with an "area ratio" larger than 1.5. "Area ratio" is a comparison of the brightness of the PCR reaction for a single-copy gene compared to the corresponding tDNA F I G U R E 3 Measures of sequence evolution poorly correlate with gene effect on fitness. Contrary to the expectation that knocking out genes with strong evidence of positive or purifying selection will have a higher impact on fitness. This figure shows dN/dS, NI, DOS, D, D* and H estimates for 1852 datapoints (i.e., including all replicates for a given line) representing 379 genes. Zeng's E, which correlates strongly with H (see Supporting Information), is not shown. The upper-right of each panel shows Spearman's rho for the correlations of each estimator with fitness, prior to correction for multiple testing. Although not plotted, a comparably insignificant correlation was found for Zeng's E (rho = 0.03, p = .22). The outlier, with a relative fitness of 8.17, is the Ras-related protein RABC1, a small GTPase insertion, with higher ratios indicating multiple possible insertions (Rutter et al., 2019). After this filtering, we continued to find no significant correlation between sequence evolution and fitness in all seven cases (n = 236 genes; Figure S6).

| DISCUSS ION
In this study, we examined the relationship between various "neutrality tests" commonly used across evolutionary systems to identify genes under selection, and empirical estimates of gene effect on fitness. Previous studies have found an association between genes that are essential and genes that are identified as evolving under pu-

| Fitness effects of genes with signatures of selection
After calculating seven different neutrality tests, we found that neutrality tests are more successful at identifying genes that are essential (for which a knockout will be lethal) than genes that have beneficial effects (those where a knockout will reduce fitness).
This might be because beneficial and deleterious traits have very different mutation profiles, and that neutrality tests fit negative selection better. However, it is difficult to evaluate this hypothesis as both the relative occurrence of different types of mutation, as well as the distribution of their fitness effects, remain poorly understood (Berdan et al., 2021). SNPs (the most common type of mutation) are most often deleterious and purged through negative selection as envisioned by neutrality tests (Keightley & Eyre-Walker, 2010). In contrast, while the most likely path to adaptation remains unknown, mutations other than SNPs are important in the adaptive process (Berdan et al., 2021)-although it is currently much easier to detect single nucleotide substitutions with the present resequencing techniques than insertions, deletions and rearrangements (Ho et al., 2020). More importantly, most neutrality tests are designed to detect positive selection on rare gainof-function alleles that sweep through populations quickly, which might not be the most common process through which adaptations occur. For example, selection on quantitative traits, especially when relying on standing genetic variation, is likely to be operating on multiple combinations of alleles simultaneously, and therefore not show the expected patterns of complete substitution or linkage disequilibria expected with fast fixation. Equally, genes experiencing adaptive loss-of-function are likely to be heterogeneous, violating the core assumptions of traditional neutrality tests (Pennings & Hermisson, 2006).
Insertion mutations are usually assumed to cause loss of function in the genes they occur, and therefore to be deleterious. Thus, it might be surprising to see that twice as many genes for which the mean effect of an insertion mutation on fitness is >1 than <1 (n = 233 and 142 genes, respectively, as detailed in Table S3; note there are four genes where KO effect on fitness is equal to 1).
However, evidence has been accumulating for nonfunctional alleles to significantly contribute to adaptation (Monroe et al., 2021). Also, the fact that gene presence/absence variation is pervasive across plant species (Bayer et al., 2020), including A. thaliana (Bush et al., 2014), support the idea that loss of function can be associated with an increase in fitness. In our study, it is possible that some of the loss of function associated with increased fitness is due to the experiment being carried out under controlled and well-provisioned conditions, but is worth noting that similar results were also observed for insertion-mutation and mutation-accumulation lines grown in field conditions (Rutter et al., 2018(Rutter et al., , 2019.
Our results raise some important questions about the relationship between neutrality tests and their ability to identify genes potentially involved in adaptive evolution. However, because estimating gene-level fitness effects remains challenging given the difficulty in performing functional genomics experiments at large scale, these results must be taken very carefully, as we discuss below. Our inability to detect an association emphasizes the importance of exploring different approaches to empirically validating signatures of selection, and how to do so at a larger scale.

| Different signatures of selection provide discordant results
In general, we found poor agreement between different neutrality tests. Such a pattern has also been reported in humans (Oleksyk et al., 2010) and in the plant Medicago trunculata (Paape et al., 2013), indicating that this may be a general pattern across eukaryotes. The fact that different methods identified different genes as under selection may not be surprising because each method relies on evidence from different evolutionary time scales (Sabeti et al., 2006).
However, there is little clarity about what time period each method explicitly targets and as it is unlikely that any particular neutrality test is specific to a distinct period in time, a degree of overlap should be seen. Thus, while differences in the temporal specificity of different methods can explain some of the discrepancy, it is unlikely to be the main explanation for the very low correlation between methods.
The disagreement between neutrality tests may also partly arise from the fact that some of the methods commonly used are vulnerable to demographic effects, particularly those drawing on frequency spectrum data (such as D, D*, H and E; see Supporting Information and references therein). Thus, it is possible that many of the genes identified as being under selection are false positives, and therefore not replicated across methods. In our data set, the influence of nonselective forces was particularly apparent when estimating Fay and Wu's H, which showed a whole genome distribution skewed towards negative values (Figure 1 and Supporting Information), attributable to inbreeding (Abbott & Gomes, 1989;Agrawal & Hartfield, 2016;Bergelson et al., 1998;Marais et al.,2004;Platt et al., 2010;Schmid et al., 2006;Tyagi et al., 2015).
Despite the general lack of agreement between methods, we were still able to identify 29 genes that are supported as under positive selection by all seven signatures of selection employed here.
Unfortunately, KO lines for these 29 genes were not included in the unPAK data set, so we could not determine fitness impact of these genes. We were also unable to find previous studies that considered the effect of these genes. Future studies specifically examining the possible fitness effects of these genes are needed.

| Important caveats of this work
To the best of our knowledge, A. thaliana is one of the very few species with comprehensive gene-level substitution, polymorphism, and fitness effect data available to perform the tests carried out here.
Nevertheless, this impressive collection of data is still far from an ideal data set to make a final conclusion about the relationship between signatures of selection and beneficial fitness effects, and we must consider three important caveats: Firstly, A. thaliana presents its own analytic challenge as, being a selfer, its set of neutrality test statistics is particularly influenced by demographic effects. This is especially apparent with Fay and Wu's H, which showed a whole genome distribution skewed towards negative values, a well-known consequence of inbreeding (Abbott & Gomes, 1989;Agrawal & Hartfield, 2016;Bergelson et al., 1998;Marais et al., 2004;Platt et al., 2010;Schmid et al., 2006;Tyagi et al., 2015). It is important to stress that our conclusions with regard to the poor agreement between each measure and reproductive fitness nevertheless do not rely on individual measures-accepting that some measures may not give a clear signal in A. thaliana, especially those based on frequency spectra-but apply to every measure tested. We may exclude data from any of the seven measures and still draw the same conclusion.
Secondly, there are still a number of limitations in estimating the fitness impact of individual genes. The lack of agreement between fitness estimates from the unPAK data set and signatures of selection may be due to the fact that a "KO" for a given gene may not necessarily correspond to a line where that gene is not functional, or where the targeted gene is the only one knocked-out. These Arabidopsis mutant lines are created by Ti insertion (Alonso et al., 2003;O'Malley et al., 2015) and depending on where the insertion occurs it may just alter the protein instead of completely preventing its expression. It is also possible that insertions occur in more than one location. In this study, the latter problem has been mitigated by selecting only for lines of A. thaliana in which there are homozygous insertions of Agrobacterium T-DNA into target genes (Alonso et al., 2003;O'Malley et al., 2015;Wang, 2008). However, for a number of genes, it remains uncertain that T-DNA insertion has necessarily resulted in complete silencing (Jupe et al., 2019) as silencing may be sensitive to the insertion position Valentine et al., 2012). In addition, the KO lines used here all were produced in a single genetic background (Col-0), which may itself impact the fitness effect estimates of any gene silenced. It is also possible that fruit count under a controlled environment (as used in the unPAK data set) might not be the closest proxy for fitness. Although we found that fruit production is significantly correlated with biomass, inflorescence height, number of rosette leaves, and rosette diameter ( Figure S7)-collectively suggesting that plants with a higher fruit count are, in general, fitter-we must acknowledge that the estimate of fitness used here was restricted to a single environmental condition, which may not be appropriate to detect fitness consequences associated with a specific environment.
For example, positively selected pathogen-associated genes would not necessarily have shown a decrease in fitness when knocked out under the unPAK conditions, since the necessary pathogen was not present. Similarly, genes associated with an increase in fitness many generations ago may still show a strong molecular signature of selection but no longer affect fitness in the present environment.
Nevertheless, the broad associations described here are not sensitive to specific genes that only function in one environment. It is also not practically possible to exclude individual genes from the analysis on the basis that there is a particular reason why knocking them out no longer matters in this environment, although we cannot rule out that this explanation may suffice for a proportion of them.
Lastly, we must acknowledge that we were limited in terms of sample size to the 379 genes (approximately 1.4% of the total of 27,655 annotated genes) for which fitness estimates were available.
It is unclear whether this is a representative sample, particularly as outliers would be expected in Arabidopsis: a number of genes experience adaptive loss-of-function and so would violate the assumptions underlying the neutrality tests used here (Monroe et al., 2021).
In conclusion, and mindful of the caveats above, we do not find a relationship between gene-level signatures of selection and effect on fitness in our Arabidopsis data set. We approach these results with caution and conclude that the abundance of gene-level neutrality test statistics is far from matched by a complementary set of fitness estimates. We believe our study clearly highlights the need for increased efforts in estimating gene-level fitness using other technologies (such as NILS and CRISPR) to characterize the fitness effects of genes under many environments, both in A. thaliana (an important model system for most plant crops) and other species (as recently advocated by Kerwin et al., 2015;Lee et al., 2014)).

ACK N OWLED G EM ENTS
We thank Laurence Hurst for comments on an earlier version of this manuscript. This study was supported by a PAPPIT-DGAPA-UNAM grant (IA204020) and NERC grant (NE/P004121/1) to AOU, as well as a USDA-REEU and a NSF grant (1052262) to CJM.

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interest to declare.

AUTH O R CO NTR I B UTI O N S
PXK, SB, CM & AUO Conceived and designed the study, contributed data and wrote the paper. SB Performed data analysis. Benefit-sharing: Benefits from this research accrue from the public sharing of our data and code, as described above.