The environment drives microbial trait variability in aquatic habitats

A prerequisite to improve the predictability of microbial community dynamics is to understand the mechanisms of microbial assembly. To study factors that contribute to microbial community assembly, we examined the temporal dynamics of genes in five aquatic metagenome time‐series, originating from marine offshore or coastal sites and one lake. With this trait‐based approach we expected to find gene‐specific patterns of temporal allele variability that depended on the seasonal metacommunity size of carrier‐taxa and the variability of the milieu and the substrates to which the resulting proteins were exposed. In more detail, we hypothesized that a larger seasonal metacommunity size would result in increased temporal variability of functional units (i.e., gene alleles), as shown previously for taxonomic units. We further hypothesized that multicopy genes would feature higher temporal variability than single‐copy genes, as gene multiplication can result from high variability in substrate quality and quantity. Finally, we hypothesized that direct exposure of proteins to the extracellular environment would result in increased temporal variability of the respective gene compared to intracellular proteins that are less exposed to environmental fluctuations. The first two hypotheses were confirmed in all data sets, while significant effects of the subcellular location of gene products was only seen in three of the five time‐series. The gene with the highest allele variability throughout all data sets was an iron transporter, also representing a target for phage infection. Previous work has emphasized the role of phage–prokaryote interactions as a major driver of microbial diversity. Our finding therefore points to a potentially important role of iron transporter‐mediated phage infections for the assembly and maintenance of diversity in aquatic prokaryotes.


| INTRODUC TI ON
For untangling the different factors that contribute to the temporal dynamics of communities, we first need to understand the mechanisms that underlie biological system variability. Identifying the factors that determine the assembly and maintenance of diversity in microbial communities over time is a central tenet in microbial ecology, and is a prerequisite to robustly predict the responses of communities to a dynamic environment (Nemergut et al., 2013).
Previous studies have demonstrated that temporal dynamics in the diversity of microorganisms is similar to that of macroorganisms (Oliver et al., 2012;White et al., 2006). Due to their huge population sizes, short generation times and relative ease of sampling, microorganisms can serve as valuable models to understand general mechanisms that drive temporal variability and changes in community diversity. For instance, it has been shown that microbial communities that switch from deterministic and niche-based, to more stochastic assembly mechanisms experience gradually higher rates of species turnover (Ayarza & Erijman, 2011;van der Gast et al., 2008).
Stochastic-driven community assembly events have therefore been argued to be responsible for high species turnover rates in communities with large metacommunity sizes (Ayarza & Erijman, 2011).
On the other hand, increased species turnover rates and temporal variability in species diversity and composition have also been linked to high levels of disturbance or environmental heterogeneity (Ager et al., 2010;Guo et al., 2019;Shade et al., 2013). These and other studies (Hatosy et al., 2013;Liang et al., 2015;Yan et al., 2012) have explored the temporal turnover or variability of taxonomic microbial units. However, while species are the natural evolutionary units that carry traits, they are selected for and assembled in a community based on their position in the niche space.
The importance of trait-based ecology for an improved understanding of the patterns and mechanisms that broadly determine the performance of complex communities has previously been highlighted (Escalas et al., 2019;McGill et al., 2006). Prokaryotes are furthermore characterized by large subspecies-level genomic plasticity due to genetic rearrangements that are often mediated by mobile genomic islands such as integrons, transposons, integrative and conjugative elements, and prophages (Bertelli et al., 2019). Such genetic rearrangements include horizontal gene transfer (HGT) via transformation, conjugation or transduction, gene loss or gene duplication, and blur the link between taxonomic identity and functional capabilities (Boon et al., 2014;Cordero & Polz, 2014). The phenomenon of prokaryotic subspecies-level genomic plasticity is addressed in the pan-genome concept, according to which at the species level 20%-35% of genes in a genome can be specific for a certain strain (Medini et al., 2005). One pertinent example are members of the aquatic SAR11 bacterial clade, which is among the most abundant organisms on Earth (Morris et al., 2002) and exhibits a large subspecies-level diversity (Ward et al., 2017). However, this subspecies-level diversity is not well resolved by the 16S rRNA gene commonly used as a phylogenetic marker: for example, genomes from the Pelagibacter strains HTCC1062 and HTCC1002 that were isolated from the same sample have been shown to differ by 31 genes inserted in the HVR3 region of HTCC1002, but only a single base in their 16S rRNA gene Wilhelm et al., 2007). Accordingly, the assessment of temporal dynamics in microbial communities based on phylogenetic marker genes may underestimate the true turnover of functional attributes in a community.
The importance of allelic variation of the same gene has been considered to be of inferior importance compared to the importance of differences in the gene content for determining physiological differences of prokaryotic cells among closely related lineages (Medini et al., 2005(Medini et al., , 2008Whitaker & Banfield, 2006). However, at the community level, allelic variations of the same gene orthologue arise often from (possibly distantly related) organisms that share the trait encoded by the shared gene orthologue, while otherwise occupying different niches in the environment. Allele diversity of gene orthologues in complex microbial communities can therefore be interpreted as a measure of their functional redundancy (Louca et al., 2018). Functional redundancy is considered to be a key parameter in community ecology, because it can buffer the functional performance of communities in a fluctuating environment (Jurburg & Salles, 2015).
In this study we address the temporal variability (betadiversity) of individual traits by assessing the variability of alleles representing individual gene orthologues in aquatic microbial communities.
To avoid manual and possibly ambiguous classification of gene orthologue groups into complex traits, we used individual gene orthologues as a proxy for similar traits, as has been done previously, for example, to estimate functional diversity (Louca et al., 2018;Raes et al., 2011). We propose that this is biologically meaningful as the presence or absence of a gene in a particular genome should influence the performance of a measurable trait in the respective organism.
Increasing metacommunity size has previously been associated with higher taxon turnover because of an increasing importance of stochastic effects in the assembly of temporally linked communities transporter-mediated phage infections for the assembly and maintenance of diversity in aquatic prokaryotes.

K E Y W O R D S
community ecology, ecological genetics, marine environments, metagenomics, microbial ecology, time-series (Ayarza & Erijman, 2011). We therefore also hypothesized (i) that functional units would be more variable over time when they originate from a large seasonal metacommunity.
Beyond the impact of stochastic events on community assembly, previous work suggests that species sorting governed by prevailing environmental conditions is the most important factor for community assembly in aquatic bacterial communities (Langenheder & Lindström, 2019). We therefore propose that with increasing heterogeneity of the environmental parameters relevant for the functionality of a certain gene orthologue, the temporal allele variability of this specific gene will increase. For this, we consider environmental heterogeneity in its broadest sense, including the variability of exogenous parameters such as salinity or temperature, but also the variability created by biotic activities: these include for instance changing chlorophyll a concentration but also more elusive parameters such as the concentration of vitamins, certain metabolites or the presence of phages. To address the relationship between environmental heterogeneity and the variability of functional units, we hypothesized (ii) that genes that are present in multiple copies within a single genome will feature higher allele variability than single-copy genes. The rationale for this is that small differences between such copies within a single genome will enable the organism to perform well in a fluctuating environment as alleles/copies with different performance optima can be engaged. One example is the genes coding for chitinases, which can appear in up to 10 copies within individual genomes (Karlsson & Stenlid, 2009). The alleles representing these copies can differ in pH optima (Miyashita et al., 1991) or in substrate versatility with regard to the crystalline form of chitin that is processed, a feature that differs in chitin-containing organisms (Svitil et al., 1997). In agreement with the idea of better performance of multicopy genes in a changing environment, the presence of multiple copies should be particularly beneficial for proteins encountering fluctuations of parameters that are relevant for their functionality.
We finally hypothesized (iii) that genes encoding products that are exposed to the extracellular milieu would feature higher temporal variability than gene products present in the cytoplasm, which are buffered against changing environmental conditions. Also here, chitinase genes can serve as an example, and it has indeed been suggested in a time-series study that compositional changes in chitinases, as typically multicopy extracellular genes, are highly dynamic over the annual cycle (Beier et al., 2012).
We have tested the above-mentioned hypotheses with data from five surface aquatic metagenomic time-series, assuming that allele variability increases with (a) larger seasonal metacommunity size, (b) higher gene copy number and (c) exposure to the extracellular environment. We estimated the temporal mean allele variability of genes that are not essential in prokaryotes and refer to them as auxiliary genes in the remainder of this paper. These genes reflect traits that are potentially, but not necessarily, present across prokaryotes in general. We additionally estimated the mean allele variability of a set of obligatory single-copy genes, considered as phylogenetic markers, in order to estimate and compare overall trait and species variability in each of the five data sets. The analysed data comprised surface ocean time-series from oligotrophic offshore sites from the Hawaii Ocean Time-series in the Pacific (HOT) and Bermuda Atlantic Time-series Study in the Atlantic (BATS) (Biller et al., 2018). We also included time-series from two marine sites closer to the coast, including the mesotrophic Linnaeus Microbial Observatory (LMO) in the Baltic Sea Proper (Alneberg et al., 2018;Hugerth et al., 2015) and the oligotrophic SOMLIT Observatory Laboratoire Arago (SOLA) station within the Mediterranean Sea (Galand et al., 2018), as well as a freshwater data set from the eutrophic Lake Mendota (MEN) (Linz et al., 2018).

| Data acquisition and bioinformatic processing
We downloaded and processed publicly available time-series metagenome data and the corresponding environmental data from two surface aquatic environments, including the Pacific (ALOHA) (Biller et al., 2018) and the Sargasso Sea (Biller et al., 2018). Three partly processed aquatic surface metagenome time-series and environmental data from Lake Mendota and the Northwest Mediterranean Sea and the Baltic Sea Proper (Hugerth et al., 2015) were provided from collaborators (Table S1).
Downstream analyses required large data sets and thus we included only metagenomes with a sequencing depth close to or exceeding 10 million reads. We analysed metagenomes from six to 10 sampling dates for each metagenome time-series, spanning over at least 6 months (Table S2). More details of the sampling procedure and the software used for bioinformatic processing of the metagenome data are summarized in Table S1. The code used in this study for data processing is available via https://www.proto cols.io/view/ allel e-varia bilit y-bkhqkt5w.
In short, all raw sequence data were quality trimmed and co-assemblies were created for each time-series with the exception of the Baltic Sea metagenomes, where the publicly available BARM assembly (Alneberg et al., 2018) was used. CDS (coding DNA sequence) regions were identified by gene calling and subsequently clustered by 100% amino acid sequence similarity using the cd-hit software (Li & Godzik, 2006). Functional annotation of the CDS regions was performed using the KEGG database (Kanehisa et al., 2007). We defined all gene variants coding for the same gene orthologue but differing in their amino acid sequence as alleles of this gene. We ignored DNA sequence differences causing synonymous replacements in the amino acid sequence. This is because we aimed in this study to explore the direct interaction between environmental dynamics and the assembly of alleles based on protein properties and therefore their amino acid but not their DNA sequences. We are aware of the possible inaccuracies introduced by the co-assembly of multiple time-series metagenome data sets: for instance, smallscale variations, such as single nucleotide polymorphisms (SNPs) causing nonsynonymous replacement that can discriminate between closely related alleles may be masked in the consensus sequences (Dick, 2018). A reduced resolution of our assembly-based approach for the identification of closely related alleles will probably result in a general underestimation of the true temporal allele variability estimated as described below. However, we believe that such inaccuracy will be similar among the considered gene categories and should therefore not bias our downstream analyses in a certain direction.
The quality-trimmed data were mapped on the respective co-assemblies using bowtie2 set to very-sensitive-local (Langmead & Salzberg, 2012) and summarized using the featurecounts software (Liao et al., 2014).
For downstream analyses, we focused on KEGG orthologues that were present in fewer than two-thirds of all prokaryotic genomes in the KEGG database (available at: https://www.kegg.jp/ kegg/downl oad/; release 78.1; downloaded: May 2016) containing data from 3,865 prokaryotic genomes), which we refer to as auxiliary genes. Our decision to exclude gene orthologues that are present in most of the prokaryotic genomes was motivated by the fact that a high proportion of these genes encode proteins involved in essential functions, such as replication, transcription or translation (Table S3).
The evolution of these genes often correlates with the evolution of neutral markers (Medini et al., 2008), and they therefore represent taxonomic rather than functional markers. We are aware that among the genes here defined as auxiliary genes, some are still obligatory for certain phylogenetic lineages and their sequence diversification can correlate with the phylogenetic evolution of members of this specific lineage.

| Gene copy number and seasonal metacommunity size
For each gene orthologue identified via the KEGG ontology, the average per-cell gene copy number among carrier genomes was determined based on the average gene copy number in prokaryote organisms of all entries coding for the respective gene orthologue (KEGG; release 78.1). In the remainder of this paper we refer to the average copy number of a gene across all carrier genomes in the KEGG database simply as "gene copy number." While the presence of different communities at each of the examined sites will result in site-specific gene copy numbers, this information cannot be determined without access to the full sequence information for each genome present at this site. However, metagenome data do not allow us to reconstruct full genome data from all community members and we therefore use the full genome data of all prokaryotic genomes stored in the KEGG database to obtain a value for the gene copy number of individual genes that was then used to approximate the true values in the communities from all sample sites. We thereby assume that the ranking from single-copy genes to high-copy-number genes will be similar across different ecosystems.
The gene-specific seasonal metacommunity size was assessed by dividing the total number of alleles for the respective gene orthologue detected throughout each metagenome time-series divided by the gene copy number of this gene orthologue. To match data from the downstream allele variability analyses, we estimated the metacommunity size only after subsampling allele count-data for the respective genes to 500 counts per individual metagenome in each time-series. The resulting parameter estimates the seasonal metacommunity richness of carrier organisms for each gene orthologue and represents a parameter for temporal gamma diversity. The term metacommunity has been defined as local communities that are connected by dispersal of multiple potentially interacting species (Wilson, 1992), while for this study we use the term seasonal metacommunity to describe several communities from the same location but across different seasons. However, spatial and temporal community dynamics are linked because a recolonization of populations being temporally extinct from one site can only occur via the dispersal of these populations remaining at spatial refuges. Therefore, we expect some overlap between the theories concerning local metacommunities and seasonal metacommunities as defined here, as for instance exemplified in the similar properties of species-area relationships and species-time relationships (White et al., 2006).
It has been recently argued that metagenome-delineated measures of the richness of functional units, such as the seasonal metacommunity parameter introduced here, are ecologically meaningful measures to address functional redundancy (Louca et al., 2018).

| Subcellular location
The subcellular location of all entries in the KEGG database (release 78.1) was determined based on the presence of signal peptides in the amino acid sequences via the psortb (version 3.0) software (Yu et al., 2010). We considered all sequences affiliating according to the KEGG taxonomy with the phyla Firmicutes or Actinobacteria as originating from gram-positive bacteria (Gontang et al., 2007) and the remaining bacterial sequences as originating from gram-negative bacteria. We applied the settings for gram-negative bacteria, gram-positive bacteria or archaea for the KEGG entries affiliating with either of these groups, respectively. To test Hypothesis 3, gene orthologues were defined as cytoplasmatic if >80% of all entries were identified as coding for cytoplasmatic proteins. Gene orthologues were defined as noncytoplasmatic if >80% of all entries were identified as coding for either extracellular, cell-wall-bound (grampositives, archaea) outer membrane-bound (gram-negatives) or periplasmatic proteins (gram-negatives). Gene-encoding proteins with a subcellular location predicted in the cytoplasmic membrane of gram negatives were ignored because their gene products can be in contact with both the cytoplasm and the outer environmental matrix.

| Temporal variability of gene-specific alleles
The temporal variability of the alleles for each individual auxiliary gene was estimated using the multivariate dispersion metric (Anderson et al., 2006), which determines the average distance of individual samples within each time-series to the group centroid. We used this method analogously to earlier applications, with the only difference that instead of compositional data for taxonomic units (e.g., operational taxonomic units [OTUs], Grubisic et al., 2017), the input data consisted of compositional data of the alleles representing the respective auxiliary gene. We applied the Bray-Curtis index for estimating pairwise dissimilarity between individual samples and the sample centroid after subsampling for each auxiliary gene to 501 counts per sample and thus only considering genes exceeding 500 counts in each individual sample. Earlier estimates of sequencing depths necessary to estimate beta-diversity found that Bray-Curtis distance-based beta-diversity patterns are particularly insensitive to low sequencing depths as long as all compared samples are normalized to the same sequencing depths (i.e., 500 counts per sample were sufficient to capture roughly 90% of the total beta-diversity of microbial communities) (Lundin et al., 2012).
We additionally estimated the temporal variability of 27 obligatory single-copy genes (Raes et al., 2007) in each of the five time-series, as described above for auxiliary genes. These were used as a proxy for the taxonomic variability in community composition.

| Statistical analyses
To compare the overall temporal variability of both functional and taxonomic units across all five time-series, mean values and the standard variation of temporal variability for all auxiliary and taxonomic single marker genes (Table S4) within each time-series was estimated (Table 1). In this case the values from the different timeseries were directly compared against each other. To improve compatibility across data sets, we subsampled all functionally annotated alleles from all individual samples to the minimal number of mapped and functionally annotated genes (1,740,778 in SRR2053279, Table   S2) prior to the second subsampling step to 501 reads per gene per sample for computing allele variability data.
We applied analyses of covariance (ANCOVAs) to test the dependence of the temporal variability of auxiliary genes on their seasonal metacommunity size, gene copy number and the subcellular location of these genes defined as detailed above. In this case, analyses were performed for each time-series separately and in order to keep the maximal predictive power, no subsampling was carried out at the level of the total metagenome data across the time-series data. All genes exceeding 500 reads in individual samples of the time-series data, annotated as either cytoplasmatic or noncytoplasmatic (Table   S3), were incorporated into the analyses. Residuals were inspected manually, and to improve their fit to a normal distribution the two continuous variables (seasonal metacommunity size, gene copy number) were log-transformed and subsequently mean-centred. Due to the highly unbalanced occurrence of cytoplasmatic versus noncytoplasmatic genes, significance was tested considering type III errors.
To avoid biases due to a singular subsampling event to a reduced data set with only 501 counts per gene per sample, we estimated the p-values for our statistics using a bootstrapping approach with 1,000 permutations. An R script with the code to estimate gene allele variability and perform the described statistical tests is available via https://github.com/sarab eier/allele.varia bility. Gene orthologues that exceeded 500 counts in the individual samples for all time-series, and that were therefore included into the analyses, were ranked in each time-series by their allele variability. The variance of the gene variability rank was estimated in order to identify genes with similar or very distinct variability across all five time-series.

| Overall environmental heterogeneity
An index for environmental heterogeneity for each of the time-series was estimated by summing the coefficient of variance for temperature, salinity and chlorophyll a concentration (Table S2)  html), respectively. Environmental data for SOLA and LMO were used as published earlier (Galand et al., 2018;Hugerth et al., 2015).

| RE SULTS
Mean values for auxiliary gene variability ranged from 0.40 in the HOT time-series to 0.52 in the SOLA time-series. Mean allele variability of phylogenetic marker genes used as a proxy for the variability in taxonomic community composition exhibited similar values, ranging from 0.40 to 0.53 (Table 1). The ranking of mean allele variability in the five time-series was nearly identical for auxiliary genes and phylogenetic marker genes (Table 1). An index of environmental heterogeneity based on the temporal variability of temperature, salinity and chlorophyll a concentration in the five time-series (Table 1) indicated a higher positive correlation to the mean allele variability of auxiliary compared to core genes (Pearson's r = .80 and r = .63, respectively). However, due to the low number of data points (n = 5), none of these correlations was significant, apart from high correlation coefficients (p = .10 and p = .26, respectively).
In all five time-series, we identified a gene orthologue encoding an outer membrane iron receptor protein (KEGG ID: K02014) as the temporally most variable auxiliary gene (Figure 1; Table S5). The allele variability of other genes was less consistent across the five data sets. For instance, the alleles of restrictios enzymes encoded by the genes hsdR and hasdM (K01153 and K03427) were highly variable in SOLA, LMO and MEN, but belonged to genes with low variability in the offshore sites BATS and HOT (Table S5). The temporal variability of auxiliary genes exhibited a strong and highly significant positive correlation to seasonal metacommunity size (p < .001, F > 125; Table 2; Figure 1), which supports Hypothesis 1. The correlation with gene copy number was weaker, but still highly significant and positive in all five time-series (p < .001, F > 61; Table 2; Figure 1). Accordingly, our analyses also support Hypothesis 2.
Our results demonstrated that in the time-series HOT, LMO and MEN, the variability of genes coding for cytoplasmatic enzymes was lower than that of noncytoplasmatic ones (p < .05, F > 5; Table 2; Figure 1). However, the impact of the subcellular location of the gene product on their temporal variability could not be verified at a pvalue cutoff p < .05 in the oligotrophic SOLA and BATS time-series, although in both cases a trend for higher allele variability of noncytoplasmatic genes was detected (Table 2; Figure 1). Hypothesis 3 was accordingly only partially supported by our analyses.
The interaction term between the parameters was not part of our hypothesis and we therefore focus on reporting the main effects of the tested variables. However, significant interaction terms among continuous variables can in extreme cases indicate that the direction of the main effect is only valid within a certain data range (Frost, 2019). Because we detected significant interaction terms in some cases (Table S6)

| D ISCUSS I ON
To our knowledge this is the first study that systematically compares multiple microbial metagenome time-series by assessing allele variability as a beta diversity measure to address the dynamics in the assembly of traits. Our data thus demonstrate that the assembly of functional units such as alleles follows similar patterns to that of tax- average gene copy number range covered, sequencing depth and data processing; Tables S1 and S2). This was of lesser importance as our primary focus was on testing our hypothesis by comparing variability patterns for genes only within each of the five time-series. Also, differences in the time range covered by the individual time-series did not seem to influence the observed patterns to any major extent. This is possibly because the largest temporal variability in microbial community composition in marine samples was shown to occur across intra-and interseasonal time frames, which were covered in all five time-series, while less variability was related to interannual timescales (Hatosy et al., 2013).
Among the variables that were tested for their impact on the temporal variability of functional units, seasonal metacommunity size clearly had the strongest effect (Table 2; Figure 1). It has been argued that a positive relationship between metacommunity size and variability in community composition is due to an increased importance of stochastic effects in the assembly of temporally consecutive communities (Ayarza & Erijman, 2011 et al., 1988). This can in turn increase the temporal variability of alleles representing this gene (Oliver et al., 2012).
We tested more specifically the impact of environmental heterogeneity on environmental variability of functional units by focusing on gene copy number and the subcellular location of gene products. As outlined in the Introduction, this is because we assumed that these variables are linked to the environmental heterogeneity to which a given gene responds in a selective process. Metacommunity size*** 1,041 <.001
F I G U R E 2 Schematic figure illustrating mechanisms that can lead to elevated allele variability of genes. Genomes 1-4 contain the phylogenetic marker gene P and two accessory genes X and Y. Genome 4 splits into the subgenomes 4A and 4B that share large parts of their genome but differ in gene copy number of gene X, for example due to the horizontal acquirement of the second gene copy of X in 4B. Temporal allele variability of accessory genes can reach high values if the population of carrier species for this specific gene is characterized by large temporal fluctuations in its composition or if genome rearrangement events lead to temporally alternating subspecies as exemplified in genome 4A/B In agreement with our assumptions, data from all time-series indicated that temporal variability of alleles increased with increasing average copy number of the respective gene ( Figure 1; Table 2).
There are, broadly speaking, two different molecular mechanisms that may lead to increased temporal variability of alleles from multicopy genes: first as compared to single-copy genes they may be exposed to higher frequencies of genomic rearrangements, such as gene duplication, gene loss or gene gain via HGT. In such scenarios, temporally alternating subspecies whose genomes differ regarding some horizontally acquired genes but not in the remaining genome would lead to a higher allele variability of the horizontally acquired genes compared to the vertically inherited genes ( Figure 2). The temporally alternating occurrence of close relatives indeed seems to be a common scenario in aquatic habitats (Cordero & Polz, 2014;Ward et al., 2017). Even though single-copy genes in a genome may also have been acquired from HGT, the appearance of multicopy genes must have been the consequence of a preceding HGT or gene duplication event. Furthermore, consecutive genomic rearrangements of genes that are already present in multiple copies within a genome, for example gene loss, are probably less deleterious than for single-copy genes. In agreement with these considerations, rearrangement events have indeed been reported to be more frequent for multicopy than for single-copy genes (Lerat et al., 2003) and this may be one molecular mechanism explaining the positive correlation of allele variability and gene copy number. A second possible mechanism that could underpin the observed positive relationships between gene copy number and temporal allele variability is that multicopy genes, independently from genome rearrangement events, may appear disproportionally often in taxa whose abundance typically fluctuates strongly over time. Multicopy genes are not equally distributed among different microbial lineages in marine environments, where taxa with oligotroph life strategies tend to feature genome streamlining (Lauro et al., 2009) and consequently a reduced occurrence of nonessential genes, such as multicopy genes as compared to copiotrophic community representatives. Indeed, in marine systems, populations adhering to the latter ecological strategy seem often to appear episodically with strongly fluctuating abundances (Giovannoni et al., 2014;Vergin et al., 2013).
As with multicopy genes, the literature proposes that cell surface proteins are disproportionally often exposed to gene rearrangement events (Nakamura et al., 2004). This has particularly been highlighted for members of the abundant SAR11 bacterial clade where the hypervariable HVR2 region mainly encodes genes that determine cell surface properties (Wilhelm et al., 2007).
Therefore, elevated genetic rearrangement events at the subspecies level of noncytoplasmatic genes, which allow species featuring multiple ecotypes to deal with different or fluctuating environmental conditions, probably contributed to the increased temporal variability of these genes detected in three of the five time-series investigated. On the other hand, we believe that a pronounced accumulation of noncytoplasmatic proteins specifically in episodically appearing taxa to be less likely as this was the case for multicopy genes: all microorganisms need proteins that are in contact with the extracellular matrix in order to interact with their environment and this should be largely independent of their life history strategy.
Remarkably, while the subcellular location of gene products significantly influenced the temporal variability of alleles in only one of the three time-series from oligotrophic marine sites (HOT), a significant correlation was seen for both the LMO and MEN time-series, both of which are classified as productive. High productivity is generally coupled with more pronounced algal blooms and consequently larger fluctuations in an array of environmental variables likely to influence microbial populations. This includes the availability of labile dissolved organic matter (Bertilsson & Jones, 2003) or pH, particularly in less buffered freshwater systems (Verduin, 1956).
It has furthermore been shown that heterotrophic bacterial community composition in aquatic habitats is strongly structured by the appearance of algal blooms (Eiler & Bertilsson, 2004;Kent et al., 2007).
Possibly, these environmental conditions are of particular relevance for the allele variability of proteins that are in contact with the outer milieu. Yet, in contrast to the BATS and SOLA, the oligotroph HOT time-series also featured significantly higher allele variability for noncytoplasmatic as compared to cytoplasmatic genes. Strikingly, the variability of temperature was very low in the BATS time-series (SD: 1.14; Table S2) compared to the HOT and SOLA time-series (SD: 3.15 and 3.66, respectively; Table S2). We speculate that in contrast to HOT, higher temperature variability in SOLA and BATS may have masked the effect of low chlorophyll variability (Biller et al., 2018) and all linked parameters in these three oligotrophic sites. Temperature variations should, in contrast to variation in other environmental parameters, not affect allele variability in contrasting subcellular locations because prokaryotes do not regulate their intracellular temperature. Consequently, temperature changes would affect inner and outer cellular proteins equally. The crucial role of temperature in structuring marine bacterioplankton has been highlighted in earlier metagenome-based studies (Raes et al., 2011;Sunagawa et al., 2015) and by noting that temperature is an important parameter driving the abundance of different SAR11 ecotypes (Brown et al., 2012).
The indicated stronger correlation of average auxiliary gene allele variability with that of the average allele variability of phylogenetic marker genes (Table 1) confirms a better resolution of functional compared to taxonomic information for matching the metagenome with the corresponding environmental data, as suggested recently (Caputi et al., 2019). Note, however, that the heterogeneity index is only a rough estimator of true environmental heterogeneity: on the one hand, chlorophyll a concentration of Lake Mendota was only available with several days of mismatch (Table S2). Furthermore, only temperature, salinity and chlorophyll a were considered in this index, while marine microbial communities are exposed to a plethora of other environmental variables. For instance, parameters such as vitamin availability that often arise from biotic activity and indicate biological interactions may be essential for community assembly (Herren & McMahon, 2017) and accordingly also for temporal allele variability. The indicated correlation between the heterogeneity index and average allele variability therefore suggests that in our study temperature, salinity and chlorophyll a concentrations may have influenced the variability of other parameters not considered that have an impact on allele variability.
Interestingly, within all time-series, a gene encoding multicopy outer membrane iron receptor proteins (K02014, including FhuA) was detected as the gene with highest temporal allele variability ( Figure 1; Table S5). Iron is a cofactor in many central reactions of organisms, such as respiration and photosynthesis, and is therefore essential for the viability of all cells (Holm et al., 1996). Proteins summarized as the KEGG orthologue K02014 interact with siderophore-bound iron, and the variability of siderophore types excreted by different siderophore producers may partly explain the high allele variability of K02014. However, besides their function as iron transporters, these proteins also serve as targets for phage infection. This was highlighted by the "Ferrojan Horse Hypothesis," according to which phages incorporate iron atoms into their tail to be recognized and enter host cells via K02014 transporters (Bonnain et al., 2016). A recent metagenome-enabled study demonstrated that iron-binding motives were present in 87% of unigenes encoding marine viral tail proteins. Furthermore, the corresponding viral contigs were distributed ubiquitously and with high abundance in marine sites worldwide sampled during the Tara Ocean Expedition (Caputi et al., 2019).
These findings indicate that "ferrojan horse" phages may play an important role in the ecology of marine systems. It has been proposed previously that the interaction of prokaryotes with phages is a major mechanism for maintaining prokaryote diversity, particularly due to the variability of genes that interact with phage infections, such as K02014 (Cordero & Polz, 2014;Rodriguez-Valera et al., 2009 Overall, our findings indicate auxiliary multicopy genes encoding proteins with contact to the outer milieu as being highly sensitive marker genes to monitor and track environmental change in aquatic environments. Modelling of ecosystem characteristics using metagenome input data, for instance via machine learning techniques, provides promising approaches, but can suffer from overfitting due to the extensive number of potential predictor variables, particularly if only a few metagenomes are available (Angermueller et al., 2016).
Focusing on a subset of relevant genes, such as those identified here empirically as highly sensitive marker genes, may improve the predictive power of such modelling approaches. A remarkable consistency of temporal patterns, such as the ranking of the tested variables with regard to their impact on allele variability, or the detection of an iron receptor and phage target protein as a maximally variable gene across all five time-series, points to relevance of trait-based analyses in predicting the assembly of complex microbial communities in a broad range of environments. We further want to corroborate previous suggestions (Boon et al., 2014;Escalas et al., 2019;Raes et al., 2011) that incorporating allele alpha and gamma diversity (e.g., allele metacommunity size) or also beta diversity patterns (e.g., allele variability) additional to gene abundance patterns into metagenome-based analyses might be a promising procedure to improve the fit of metagenome information to environmental data. Diversity measures seem to be particularly relevant for addressing the link between an ecosystem's heterogeneity and functional redundancy as well as assembly dynamics of inherent microbial communities.

ACK N OWLED G M ENTS
We

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly avail-