Exploring environmental intra‐species diversity through non‐redundant pangenome assemblies

At the genome level, microorganisms are highly adaptable both in terms of allele and gene composition. Such heritable traits emerge in response to different environmental niches and can have a profound influence on microbial community dynamics. As a consequence, any individual genome or population will contain merely a fraction of the total genetic diversity of any operationally defined “species”, whose ecological potential can thus be only fully understood by studying all of their genomes and the genes therein. This concept, known as the pangenome, is valuable for studying microbial ecology and evolution, as it partitions genomes into core (present in all the genomes from a species, and responsible for housekeeping and species‐level niche adaptation among others) and accessory regions (present only in some, and responsible for intra‐species differentiation). Here we present SuperPang, an algorithm producing pangenome assemblies from a set of input genomes of varying quality, including metagenome‐assembled genomes (MAGs). SuperPang runs in linear time and its results are complete, non‐redundant, preserve gene ordering and contain both coding and non‐coding regions. Our approach provides a modular view of the pangenome, identifying operons and genomic islands, and allowing to track their prevalence in different populations. We illustrate this by analysing intra‐species diversity in Polynucleobacter, a bacterial genus ubiquitous in freshwater ecosystems, characterized by their streamlined genomes and their ecological versatility. We show how SuperPang facilitates the simultaneous analysis of allelic and gene content variation under different environmental pressures, allowing us to study the drivers of microbial diversification at unprecedented resolution.

| 1725 PUENTE-SÁNCHEZ et al. interactions that underpin their dynamics and change over time, space, and along environmental gradients (Galand et al., 2018;Louca et al., 2018). However, these advances also showed that variability was prevalent at all taxonomic scales, including striking differences in functional properties and environmental partitioning between microorganisms belonging to the same species (Larkin & Martiny, 2017).
Microbial species are comprised of multiple strains that share a core genome and differ in the accessory genome. The term "pangenome" has been coined, referring to the combined pool of genes found in known strains of a species, and this pangenome can be partitioned into a core and accessory component (Tettelin et al., 2005). Variations in the accessory genome are assumed to be responsible for niche differentiation and allow the discrimination of ecologically meaningful subpopulations (Cohan, 2001;Coleman & Chisholm, 2007;Koeppel et al., 2013). This variation, often referred to as "intra-species diversity" or "microdiversity", has profound consequences for the functioning of microbial communities (Fuhrman & Campbell, 1998;García-García et al., 2019). However, its extent and specific roles remain poorly characterized due to the difficulty in obtaining high-quality genomes from relevant environmental populations. The description of microbial species has traditionally been dependant on culturability, which greatly limited our view of microbial biodiversity (Sanford et al., 2021).
However, recent methodological advances in sequencing capacity and bioinformatic tools have enabled reconstruction of near-complete microbial genomes from shotgun metagenomic data (metagenomeassembled genomes [MAGs]), and also group them based on their average nucleotide identity (ANI) into metagenomic operational taxonomic OTUs (mOTUs) which correspond to species-level clusters from a purely genomic viewpoint (Jain et al., 2018;Richter & Rosselló-Móra, 2009;Sunagawa et al., 2015).
This wealth of culture-independent genomic data has allowed researchers to move beyond the use of single reference genomes (Coleman & Korem, 2021), spurring the development of tools aiming to capture the full complexity of microbial pangenomes such as Roary (Page et al., 2015), PPanGGOLiN (Gautreau et al., 2020), PanACoTA (Perrin & Rocha, 2021), or mOTUpan (Buck et al., 2022). These tools work by generating gene clusters from the input genomes, and then rely on different statistical approaches to classify these as part of the core or accessory genome. A disadvantage is that this approach relies on external prediction software such as Prodigal (Hyatt et al., 2010) for identifying genes in the input genomes. This is a straightforward task for genes encoding proteins or ribosomal/transfer RNAs, but becomes increasingly difficult for non-coding regions which might nonetheless mediate essential functions (Rogozin et al., 2002). Another shortcoming is that information on synteny is in principle lost when breaking the genome into separate features, although methods such as Roary and PpanGGOLin do keep track of gene neighbourhood information when classifying clusters into the core or accessory gene sets.
An alternative to this is to directly build assembly graphs from sequence data in order to represent the pangenome (Brown et al., 2020;Colquhoun et al., 2021;Quince et al., 2021). Notably, the STRONG pipeline (Quince et al., 2021) aims to build a coassembly graph from different samples using metaSPAdes (Nurk et al., 2017), after which it performs binning and haplotype resolution. This method is particularly well suited for longitudinal or time-series metagenomic studies, but does not allow inclusion of genomes obtained from other sources, such as isolates or single-cell amplified genomes (SAGs), or even MAGs obtained from other studies. Furthermore, the high memory usage of metaSPAdes makes it impractical for assembling very large datasets (Van der Walt et al., 2017), constraining the number of samples that can be analysed with a coassembly-based approach.

| Software description
Here we present the SuperPang software to generate non-redundant pangenome assemblies from multiple input genomes from the same species/mOTU. The input genomes can come from isolates, SAGs or MAGs indistinctively, and genomes from different sources and different qualities can be combined in the same analysis. Figure 1 summarizes the functioning and results of SuperPang, and how they compare to those obtained with raw pangenomes or single reference genomes.
SuperPang works by first homogenizing the input sequences so that homologous regions come to have identical sequences, and then build a pangenome assembly graph. In this graph, homologous regions are dereplicated and linked together following their synteny in the input genomes, providing a transparent and useful representation of the pangenome architecture. The regions of the pangenome that were always found together in the input genomes will be represented as non-branching paths (NBPs), with branching points indicating either reordering events or interruption of synteny due to the presence of accessory genes. NBPs thus naturally identify modules of genes with preserved synteny that are always inherited or transferred together, potentially due to functional relatedness (e.g. gene operons or genomic islands [GIs]). Paralogous sequences, which are generally highly diverged in prokaryotic genomes (Copley, 2020;Pushker et al., 2004) are expected to not be collapsed and instead be represented separately in the right genomic context. The mOTUpan algorithm (Buck et al., 2022) is then run to classify each NBP into the core and accessory genomes based on their occurrence in the input sequences. The NBPs are then exported as nucleotide sequences, which facilitates a modular analysis of the core and accessory genomes of the species. Furthermore, the graph is traversed in order to combine the different NBPs into longer contigs that follow the consensus gene ordering in the pangenome.
The result of running SuperPang is a non-redundant reference assembly for the input species which, unlike the representative genomes obtained by dereplication methods such as drep (Olm et al., 2017), contains both the core and accessory genomes. This assembly can be annotated with tools such as Prodigal (Hyatt et al., 2010), Prokka (Parks, 2014) or SqueezeMeta  in order to identify protein-coding sequences, RNA-genes or other non-coding elements of interest, but it will also preserve the information on the intergenic regions, potentially allowing for the discovery of non-coding, regulatory regions involved in intra-species diversification. This assembly will F I G U R E 1 Strategies for analysing the genetic diversity of microbial species. Upper left: the use of all the available genomes from the species of interest allows to analyse the totality of the accessory genome, but many regions will be redundant, resulting in ambiguous read mapping. Lower left: choosing a single reference genome for mapping will result in non-ambiguous mappings, but reads originating from the missing accessory genes will not be recruited. Right: SuperPang will produce a complete and non-redundant pangenome assembly partitioned into core and accessory components that preserves gene ordering and can be used for different applications.
unambiguously recruit virtually all the shotgun meta'omic reads originating from the input species, enabling variant calling not only in the core, but also within the accessory genome. Read mapping can also be easily used to track the prevalence of the different accessory genes across populations. The SuperPang assembler is thus a versatile tool for high-resolution studies of intra-species diversity. Below, we provide details on its implementation and performance, as well as several examples highlighting its potential applications.

| Software implementation and availability
SuperPang is implemented in the Python3 and Cython3 programming languages and released under the GNU General Public License v3.0.
The source code is available at https://github.com/fpusa n/Super Pang.

| Computational resources
All analyses were performed in an Ubuntu 20 workstation with an AMD Ryzen 95,900X 12-Core Processor and 128 Gb of RAM. When applicable, processes were run using 24 CPU threads. Data used for benchmarking were stored in a 1 Tb Samsung SSD 980 PRO 1 TB drive. Preliminary testing was also performed in UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science) computing cluster.

| Input data
SuperPang accepts a set of FASTA files, each containing the contigs from a different genome. Genomes are assumed to belong to the same phylogenetic cluster (e.g. same species, or same 95% ANI mOTU). Genomes can be complete genomes, draft genomes, MAGs or SAGs. Optionally, the user can also supply the standard output obtained by running CheckM (Parks et al., 2015) over the input genomes. The completeness of each input genome will then be used to differentiate between the core and accessory contigs in the final assembly (see below).

| Homogenization of input sequences
Prior to assembly, we identify homologous regions in the different input sequences and homogenize them so that they will later assemble into the same contig. The objective is to generate a pangenome assembly that contains the entirety of the core and accessory genomes, but in which polymorphisms are collapsed into a single consensus sequence, thus avoiding duplications in the assembled core genome. Homogenization is performed by the following iterative algorithm: 1. While the number of iterations is less than a certain threshold r and at least one sequence was homogenized sequence in the previous iteration.
1.1. Sort the input sequences by decreasing length.
1.2. For each target sequence, visited in decreasing length order, that has not been corrected by a longer target sequence during this iteration.
1.2.1. For each query sequence smaller than the target sequence, that has not been corrected by a longer target sequence during this iteration, and has not been corrected by this target sequence in previous iterations.
1.2.1.1. Align target and query using minimap2 (Li, 2018) with 1.3. Write the sequences (including any changes resulting from homogenization) to a temporary output file to be used as the input for the next iteration.
2. Use the output of the last iteration as the final output.
The different thresholds can be controlled by the user, with defaults of i = 0.95, m = 100, g = 100, r = 20. Increasing m and g will result in a more aggressive homogenization: longer contiguous mismatch and indel stretches will be homogenized, reducing the duplication levels in the core genome after pangenome assembly, but potentially also removing some accessory genes. Sequence homogenization is performed by default when running SuperPang, but can also be run independently by calling the homogenize.py standalone script.

| Pangenome assembly
Input sequences are stored into a De-Bruijn graph (DBG) of default kmer size k = 301 using Graph-Tool (Peixoto, 2014). Non-branching paths (NBP) in the DBG are then used to build a sequence graph (SG), in which each node represents a single NBP and vertices join NBPs that have been observed to be contiguous in the same input sequence. The SG is then split into connected components, and pairs are identified such that the sequences from both members are each-other's reverse complements. For each pair, the component with the highest numbers of NBPs in the same orientation as the input sequences is deemed the forward component and reported as a scaffold in the assembly. NBP coverage is defined as the average prevalence of its constituent kmers in the input genomes, we note that this is unrelated to the coverage of the NBP in any particular sample. NBPs sharing the first and last kmers (i.e. corresponding to a bubble in the DBG) will be aligned pairwise using the mappy module from the minimap2 suite (Li, 2018). In case the homology between them is higher than a certain threshold b (default 0.95) only the longest NBP will be kept in the assembly.

| Core genome identification
NBPs are classified with the Bayesian classifier mOTUpan (Buck et al., 2022) in order to determine their likelihood to belong to the core or the accessory genome. Sequences sharing a fraction a of its kmers (of length k, as used to build the DBG; default a = .5) with an input genome are deemed to be present in that genome. For each NBP sequence, mOTUpan then classifies it as core/accessory/singleton by taking into account its prevalence in the input genomes, and the estimated completeness of those genomes (Buck et al., 2022).
If completeness estimates are not provided as an input, SuperPang assumes a 50% starting completeness and lets mOTUpan calculate posterior estimates.

| Contig generation
NBPs are combined by traversing the SG with the following greedy algorithm. 2. Remove the visited nodes from the SG and recalculate its origins.
The assembler outputs a FASTA file containing the contigs, and a FASTG file (following the SPAdes/MEGAHIT specification) containing assembly graph and suitable for visualization in Bandage (Wick et al., 2015). A separate info file keeps track of which regions of each contig were deemed to be core, accessory or singletons. Individual NBPs are also outputted into three different files containing all the NBPs, the core NBPs, and the accessory NBPs respectively.
SuperPang was run using 24 CPU threads on an increasing number of genomes, in multiples of five, until finally running it in the totality of the 113 genomes. For each step, SuperPang was run in 10 replicates by drawing the requested number of genomes at random without replacement. For each SuperPang run, we tracked the total execution time, maximum memory usage, the total size in base pairs of the predicted core and accessory genomes, and the completeness and contamination of the predicted core genome as measured by CheckM (Parks et al., 2015) using the marker genes for the genus Polynucleobacter.

| Read recruitment
One million synthetic metagenomic reads of 125 base pairs (bp) were simulated from the unassembled 113 input genomes using InSilicoSeq (Gourlé et al., 2019). The -mode perfect and -coverage uniform parameters were used to generate reads with no errors that covered the input genomes uniformly. Reads were mapped back to the SuperPang assembly obtained from the same 113 input genomes using blastn (Altschul et al., 1990) with a threshold of 95% identity and 90% query coverage. We then counted the proportion of reads that mapped at least once to the SuperPang assembly, and also the proportion of reads that mapped in two or more positions.

| Core and accessory dynamics of environmental species
We ran SuperPang (v0.9.2) on genomes from a Polynucleobacter mOTU that was previously published as part of the StratFreshDB (Buck et al., 2021). In total, 44 MAGs with completeness above 40% and contamination below 3% were selected ( We then used SqueezeMeta  to taxonomically and functionally annotate the NBPs present in We expect execution time and memory usage to also depend on genome size and overall pangenome complexity, meaning that for many species they will be higher than the ones reported here for P. paneuropeus, which has a relatively streamlined genome. We also note that, while execution time was linear with the number of input genomes in our testing, the worst case scenario for input homogenization and other steps of SuperPang is quadratic, which might also become an issue for very complex datasets.
The size of the predicted core genome had a median of 1.49 Mbp for five input genomes, after which it decreased slightly and stabilized around 1.48 Mbp (Figure 2c). Core completeness remained stable around 88% with the number of input genomes (Figure 2d), while core contamination, which acts as a proxy for the core duplication level, remained around 1% (Figure 2e). This verifies earlier findings that the mOTUpan algorithm yields an accurate prediction of the core genome, even when the number of input genomes is low (Buck et al., 2022). In contrast, the size of the predicted accessory genome kept increasing with the num- Finally, even although SuperPang uses the mOTUpan algorithm for core and accessory genome estimation, it applies it to nonbranching assembly paths rather than gene clusters. We therefore obtained gene clusters from the SuperPang core NBPs, and compared them with the gene clusters predicted as core by mOTUpan alone (Buck et al., 2022) and PpanGGOLiN (Gautreau et al., 2020).
Out of 1537 gene clusters predicted as core by at least one method, 1395 (91%) were predicted as core by the three methods ( Figure S1).
This shows that the core genome estimation from SuperPang is in good agreement with that of preceding tools, and that mOTUpan produces comparable results when the input are NBPs instead of gene clusters. We thus refer to Buck et al. (2022) for a more extensive validation of its core-genome estimation capabilities.

| Test case 1: Identification of genomic islands
We first reconstructed the pangenome of Polynucleobacter asymbioticus (Hahn et al., 2016), which analogous to P. paneuropaeus is a widespread planktonic freshwater bacterium with a dynamic accessory genome hosting a diverse set of mobile GIs (Hoetzinger et al., 2017). Running SuperPang on genomes from nine isolates of the species yielded a core genome of 1.75 Mbp and 216 accessory NBPs longer than 1000 bp (1.44 Mbp in total). The majority of them (214, 1.44 Mbp) assembled into the main scaffold together with the core NBPs, which makes them good candidates for representing GIs.
We then assessed whether the accessory part of the SuperPang assembly could recover GIs that were defined in an earlier and more laborious approach as consecutive sequences of auxiliary genes longer than 10 kbp. Indeed, all 28 GI variants identified in the previous study were recovered, and >50% of the accessory genome of each strain was covered by NBPs longer than 10 kbp (compare

F I G U R E 3
Illustrating genomic islands in reference genomes using the SuperPang assembly. (a) Accessory NBPs longer than 10 kbp obtained from SuperPang. The NBPs are coloured according to Figure 3 in Hoetzinger et al. (2017) after alignment against the P. asymbioticus genomes. They are numbered in descending order of sequence length and arranged according to the order they appear in the reference genomes. (b) Core NBPs were ordered against the genome of strain QLW-P1DMWA-1 and concatenated. The thereby obtained core genome is represented in the middle (in green), and blastn hits to two reference genomes are depicted in grayscale according to sequence identity. GIs appear as alignment gaps, i.e. sequences present in a reference genome that do not align to the core. Red bars indicate tRNAs of the reference genomes. Blastn alignments to the auxiliary NBPs as shown in (a) are displayed above and below QLW-P1DMWA-1 and MWH-RechtKol4, respectively. NBPs aligned to GIs for which a specific function has been inferred previously in Hoetzinger et al. (2017) are named (CSC, cell surface composition; NIT, assimilatory nitrate reduction; HMR, heavy metal resistance; LIP, lipid metabolism; COD, carbon monoxide dehydrogenation; GGR, giant gene region; SUL, sulphate transport; ACD, aromatic compound degradation). (c) Example of a functional gene cassette split into separate NBPs, caused by a 101 bp long deletion in a reference genome. Genes associated to the ACD cassette are highlighted by grey fill colour. The nucleotide sequence of the NBPs in the split region is shown together with an alignment of the respective sequence in three reference genomes. (d) Accessory contig obtained from the SuperPang assembly containing the complete ACD cassette. NBPs contained in the contig are given above the gene representation. Another type are so called additive GIs, which are associated with mobile genetic elements and can harbour a variable number of gene cassettes that are typically flanked by integrases or transposases and/or tRNAs that may serve as target sites for integrases (López-Pérez et al., 2013). Six gene cassettes located within additive GIs were associated to metabolic functions in the previous study of P. asymbioticus (Hoetzinger et al., 2017), namely assimilatory nitrate reduction (NIT), lipid metabolism (LIP), carbon monoxide dehydrogenation (COD), sulphate transport (SUL) and heavy metal resistance (HMR). While we recovered these gene cassettes from the accessory NBPs, some of them were split into multiple NBPs due to sequence divergence among the reference genomes (Figure 3c). This drawback can be overcome by using the consensus contigs present in the assembly.fasta file, which combine NBPs according to their synteny in the sequence graph. By splitting this assembly at the core-accessory boundaries, accessory contigs were obtained that contained complete functional gene cassettes (Figure 3d, Figure S2). These results confirm that partitioning the auxiliary genome into contiguous sequences rather than separated genes, as done by conventional tools for pangenome analysis (Gautreau et al., 2020;Page et al., 2015), carries biological meaning and is favourable considering the emergence principle.

| Test case 2: Extending pangenomes with environmental sequences in known microbial species
We sought to generate an extended pangenome assembly of P. asymbioticus that included novel diversity from environmental sequences.
For this, we screened StratFreshDB (Buck et al., 2021), a dataset shotgun reads and assemblies from more than 250 environmental freshwater samples, and retrieved the MAGs that shared more than 95% ANI to any of our P. asymbioticus genomes. This search recovered five MAGs with less than 5% contamination and variable completeness levels (from 87% to 9%; Table S1). We combined these MAGs with the isolate genomes into a new dataset, and ran SuperPang to generate a new assembly that, using the isolate genomes as a backbone, could also incorporate new information that was only present in the environmental MAGs. This allowed us to identify 40 new accessory NBPs (296 kb in total) longer than 1000 bps that assembled into the main scaffold. Among them, we found what appeared to be a prophage that was missing in the isolate genomes, but present in the MAGs (Figure 4; Table S3). The graphical nature of the SuperPang output allowed us to easily determine the genomic context where the phage had integrated ( Figure 4c). Finally, five accessory NBPs (13 kb in total) were found only in the two low-quality MAGs (9.97% and 9.34% completeness respectively), but nonetheless assembled into the main scaffold. This shows the potential of SuperPang to recover information also from incomplete assemblies coming from environmental metagenomes, provided that other more complete genomes are also available.

| Test case 3: Spatiotemporal dynamics of the core and accessory genomes in uncultivated microorganisms
We ran SuperPang on 44 MAGs from a currently undescribed Polynucleobacter species, and used the results to analyse its population structure over a set of 44 samples (seven time points, variable depths) from Lake Loclat (Switzerland). We considered two sources of intra-species diversity: 1) point mutations leading to allelic variation and 2) the gain/loss of accessory genes, which leads to differences in gene content between populations. This allowed us to study the intra-species dynamics of the core and accessory genomes in both time and space.
We first studied population structure by performing an ordi-  of Polynucleobacter populations based on their accessory gene content, and visualized it using PCoA (Figure 5c).
Samples from the first time point clustered separately from the rest (Figure 5c, black points), as previously observed for the core and accessory FST. Samples from December, January and April formed very tight clusters (Figure 5c, blue, turquoise and green points), as opposed to the rest of the time points, where samples were more scattered. Interestingly, the former coincided with the period when the water column of Lake Loclat was subject to wind-driven homogenization ( Figure S1). The existence of such tighter clusters suggest that, when the lake is mixed, the species is strongly dominated by a single population. Similarly, the higher dissimilarities in accessory gene content when the lake is stratified may imply the existence of different subpopulations being successful at different depths.
We also assessed whether differences in allelic frequencies (FST) could be predictive of the accessory gene content. Our results suggest that this is the case, showing a high agreement between core FST, accessory FST and accessory gene content dissimilarity ( Figure 5d-f). This correlation was however noisy, suggesting that accessory genes might not be always bound to certain alleles in the core genome. Finally, we investigated how the prevalence of individual NBPs in the population changed with time or in response to environmental variables ( Figure 6). Some of the NBPs presented a sharp transition in copy number after the first sample (Figure 6a-c), similar to the one observed for allelic frequencies in the core genome ( Figure 5a). We could however also find examples of NBPs whose prevalence was associated with temperature, dissolved sulphate, or dissolved nitrate (Figure 6d-f Table S4.
multiple populations of the same species and as such take us one step further to describe and study the finer details of naturally complex microbial communities.

AUTH O R CO NTR I B UTI O N S
FP-S implemented the software. FP-S and MH designed research and wrote the draft manuscript. FP-S, MH and MB performed research. All authors contributed to the study conception, discussed the results and contributed to the final manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Genomic data is available in the GenBank database, accession numbers are available in Table S1. Metagenomic data and associated metadata are available in the SciLifeLab Data Repository (https:// figsh are.scili felab.se/artic les/datas et/Strat Fresh DB_v1_0/13005 311/2).

B E N E FIT S H A R I N G S TATE M E NT
There are no benefits to report.