Comprehensive pan‐genome analysis of Lactiplantibacillus plantarum complete genomes

Abstract Aims The aim of this work was to refine the taxonomy and the functional characterization of publicly available Lactiplantibacillus plantarum complete genomes through a pan‐genome analysis. Particular attention was paid in depicting the probiotic potential of each strain. Methods and results Complete genome sequence of 127 L. plantarum strains, without detected anomalies, was downloaded from NCBI. Roary analysis of L. plantarum pan‐genome identified 1436 core, 414 soft core, 1858 shell and 13,203 cloud genes, highlighting the ‘open’ nature of L. plantarum pan‐genome. Identification and characterization of plasmid content, mobile genetic elements, adaptative immune system and probiotic marker genes (PMGs) revealed unique features across all the L. plantarum strains included in the present study. Considering our updated list of PMGs, we determined that approximatively 70% of the PMGs belongs to the core/soft‐core genome. Conclusions The comparative genomic analysis conducted in this study provide new insights into the genomic content and variability of L. plantarum. Significance and Impact of the Study This study provides a comprehensive pan‐genome analysis of L. plantarum, including the largest number (N = 127) of complete L. plantarum genomes retrieved from publicly available repositories. Our effort aimed to determine a solid reference panel for the future characterization of newly sequenced L. plantarum strains useful as probiotic supplements.


INTRODUCTION
Next-generation sequencing (NGS) has extended our ability to understand complex biological phenomena, influencing drastically the experimental settings used up to the beginning of the second decade of the present century. Indeed, NGS allows the study of complex systems through the acquisition of an extensive amount of high-quality data in relatively short time and low cost. At the same time, the implementation of solid and easy-to-use data retrieval systems from public databases is allowing the analysis of large data sets at almost zero cost. In this context, theomics field, from genomics to proteomics and from transcriptomics to metagenomics, continues to expand unceasingly, improving both the pre-existing analytical pipelines and our ability to interpret results deriving from very complex matrices. Nowadays, it is possible to contextualize the results coming from big data analyses with relative ease, thus providing findings with high translational impact.
In the present work, we aim to refine the taxonomy and the functional characterization of Lactiplantibacillus plantarum (L. plantarum), formerly known as Lactobacillus plantarum, a versatile Gram-positive lactic acid bacterium, originally discovered in saliva, belonging to the large family of Lactobacillacae and present in a large range of environmental niches (Siezen et al., 2010). Notably, L. plantarum, one of the largest genomes known among the lactic acid bacteria, is able to survive gastric transit, thus easily colonizing the gut of human and other mammals (de Vries et al., 2006). Because of these properties, L. plantarum is considered one of the most used bacterial strains in food industry as probiotic and/ or microbial starter. The utilization of L. plantarum strains, characterized by their long history in food fermentation, is an emerging field in the designing of value-added foods (Behera et al., 2018). Indeed, L. plantarum strains have been used to produce new functional (traditional/novel) foods and beverages with improved nutritional and technological features (Behera et al., 2018). Lactiplantibacillus plantarum strains were identified from many traditional foods and characterized for their systematics and molecular taxonomy, enzyme systems (α-amylase, esterase, lipase, α-glucosidase, β-glucosidase, enolase, phosphoketolase, lactase dehydrogenase, etc.), and bioactive compounds (bacteriocin, dipeptides, and other preservative compounds) (Behera et al., 2018). Moreover, recent studies on microbiome composition, both in humans and in animal models, showed that L. plantarum strains possess clinically beneficial properties, to ameliorate dysbiosis states occurring in several medically relevant conditions, such as obesity (Soundharrajan et al., 2020) or cognitive dysfunction in major depression (Rudzki et al., 2019). A recent work from our group has also demonstrated that L. plantarum is able to prevent colonization of the urogenital tract by relevant pathogens such as Candida strains (Coman et al., 2015).
In this context, given its high translational potential in food industry and in clinical settings as well, a comprehensive analysis of deposited L. plantarum strain genomes, both at phylogenetical and functional level, by means of pangenome analysis, may provide useful insights into the different properties of L. plantarum strains. We expect this will also allow for a better selection of L. plantarum strains to be used in industrial settings and for an improved understanding of their effects on human health tout court. Moreover, a comprehensive pan-genome analysis of L. plantarum complete genomes may be serving as a reference, to help characterizing and identifying the beneficial properties of new isolated strains, potentially introducible in probiotic-supplementation formulas or in the production of functional foods.
Notably, two recent studies have already performed a pan-genome analysis of L. plantarum, considering the genome of 108 and 49 strains, respectively (Choi et al., 2018;Evanovich et al., 2019). In addition, another recent study reported the comparative pan-genome analysis of five different Lactobacillus species (i.e. L. reuteri, L. delbrueckii, L. plantarum, L. rhamnosus and L. helveticus), including 124 genomes of L. plantarum (Inglin et al., 2018). However, the main limitation of those studies lies in the fact that most of the genomes included in their analyses were not complete, providing the analysed genomes just at their 'draft' stage. The use of poorly assembled genomes, such as the ones provided in their draft stages, can intrinsically lead to analytical biases, therefore to incorrect taxonomic and/or functional characterization of the different strains.
Herein, we provide a comprehensive pan-genome analysis of complete L. plantarum genomes (N = 130), comparing the most updated genomic information with the previous findings, that mostly leveraged publicly available L. plantarum draft genomes, and QCing them according to the most updated pan-genome analysis pipelines (Wu et al., 2021). This will greatly expand our knowledge of L. plantarum biology, while providing, at the same time, a direct validation of the previous published findings.

Lactiplantibacillus plantarum complete genome sequence retrieval
The complete genome sequence of L. plantarum strains was downloaded from the National Center for Biotechnology Information (NCBI), under the 'Assembly' section, querying for 'Lactiplantibacillus plantarum'. Of the 541 available assemblies (July 2020), 130 were reported to have a complete assembly, thus with guaranteed full genome representation, which were used in the present work. Details regarding the identification, isolation source and sequencing of the samples are described in Table S1. Among these 130 genomes, we removed three L. plantarum strains (CNEI-KCA5, KLDS1.0391 and SN13T) that were missing the RefSeq assembly because of detected anomalies, as reported by NCBI (e.g. missing tRNA genes, many frameshifted proteins). Thus, we finally obtained 127 strains of L. plantarum for subsequent analyses.

Phylogenetic and average nucleotide identity analysis
According to Wu et al., (2021), the inclusion of confounding strains may introduce important biases that will greatly influence the interpretation of the pan-genome analyses. Thus, as suggested by Wu et al., (2021), we first determined the phylogenetic relationship among the 127 L. plantarum strains using OrthoFinder v2.4.0 (Emms & Kelly, 2019) with default parameters, using the protein sequences obtained from Prokka annotation. Then, we performed average nucleotide identity (ANI) analysis using FastANI v1.31 (Jain et al., 2018) with default parameters, using the nucleotide sequences directly retrieved from NCBI.

Identification of clinically relevant genomic elements and functional clustering
The information regarding the presence of plasmids was retrieved from NCBI Assembly page for each strain. To further explore the landscape of non-chromosomal genome sequences, we also queried PlasmidFinder v.2.0.1 for the presence of previously annotated replicons, considering the 'Enterobacteriaceae +Gram-positive' database, setting an 80% threshold for minimum percentage of identity and a 60% threshold for minimum coverage (Carattoli et al., 2014). PHAge Search Tool Enhanced Release (PHASTER) (Arndt et al., 2016) was used to screen for prophage specifying DNA regions within the genome of all strains. The bacterial genome sequence in FASTA format was used as input to detect the genes responsible for bacteriocin production using BAGEL4 software (van Heel et al., 2018). Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and their associated (Cas) protein were data mined using CRISPRCasFinder (Couvin et al., 2018).
Using the Comprehensive Antibiotic Resistance Database 2020 (Alcock et al., 2020), we further determined the presence of acquired antibiotic resistance genes, prophages, bacteriocins and plasmids in the L. plantarum strains analysed.

Statistical analysis
Fisher's exact test was applied to contingency tables using R, with a statistical significance threshold set at p < 0.05 (two sides).

Main genomic features of L. plantarum pangenome
Compared with previous studies reporting pan-genome analysis of L. plantarum strains (Choi et al., 2018;Evanovich et al., 2019;Inglin et al., 2018), our study includes 107, 80 and 110 additional strains, respectively, for which the complete genome is now publicly available (Table S1). The analysis of 127 L. plantarum genomes by Orthofinder ( Figure 1) and FastANI ( Figure 2) did not show the presence of misassigned strains to the species. Thus, the full data set was considered in the subsequential analyses.
Nonetheless, both the analytic tools, when providing the relative phylogenetic trees, clearly displayed the presence of four (SRCM100438, SRCM100434, SRCM100440 and SRCM100442 by Orthofinder, Figure 1) and three strains (SRCM101187, ATG-K2 and DSM 16365 by FastANI, Figure 2), respectively, that clustered separately from the rest of whole data set. The genome of the four phylogenetically distant strains identified by Orthofinder were all provided by the Microbial Institute for Fermentation Industry located in South Korea and apparently did not show any particular genomic feature to be considered outliers or mis-assigned strains to the L. plantarum species. Conversely, the three phylogenetically distant strains detected by FastANI represent the ones with the highest GC content among all the L. plantarum strains analysed.
The average full genome size and GC content of the 127 L. plantarum strains were 3.32Mb and 44.5%, respectively, with a number of plasmids ranging from 0 to 14 ( Figure 1). The average non-chromosomal genome size was 119Kb,

Lactiplantibacillus plantarum pangenome analysis
Roary analysis of L. plantarum pan-genome identified 1436 core, 414 soft core, 1858 shell and 13,203 cloud genes, respectively, out of 16,911 total genes ( Figure 3a). The large number of cloud genes implies that a large heterogeneity exists among the 127 L. plantarum strains considered, highlighting the 'open' nature of L. plantarum pan-genome ( Figure  3b). Nonetheless, we noticed that the number of new genes is progressively decreasing, proportionally to the number of genomes included in the analysis, while approximatively 30 new genes are continuously added for each additional genome after the first 100 genomes considered ( Figure 3c).
The phylogenetic tree based on orthologous genes found by Roary was compared with the one obtained from the coregenome analysis performed using Parsnp v1.5.3 (Treangen et al., 2014) (Figures 4 and 5). Both phylogenetic trees defined three main clades that showed a different strain distribution, both at qualitative and quantitative level (Table 1). Indeed, none of the strains belonging to the first clade were consistent between the two phylogenetic trees; in addition, strain distribution across the three clades was significantly different (p = 0.036). Strain distribution across the three clades differed significantly only for the ones determined by the phylogenetic tree based on orthologous genes, when stratified according to isolation source category (p < 0.018, Table 1).

Mobile genetic elements and adaptative immune systems in Lactiplantibacillus plantarum pangenome
Determining and characterizing mobile genetic elements (MGEs) are of paramount relevance when defining the F I G U R E 1 Phylogenetic tree obtained by OrthoFinder and genomic features of the 127 Lactiplantibacillus plantarum genomes. The Circos heatmap from the inside to the outside report the genome size, gene number, GC content and number of plasmids for each strain  probiotic potential for a given strain because they largely contribute to antibiotic resistance (Tait, 1993) and to horizontal gene transfer (Rankin et al., 2011). MGEs include plasmids, transposons and bacteriophages. Conversely, CRISPR-Cas elements and bacteriocins represent adaptative immune systems to protect against deadly consequences from MGEs or competing bacteria (Cotter et al., 2005;Klaenhammer, 1993;Peters et al., 2017).
Bacteriophage identification by PHASTER (Arndt et al., 2016) showed that the sequences of bacteriophage origin varied from 35Kb (strain SRCM101511) to 300 kb (strain DF), that is, about 1-8% of the size of the L. plantarum genomes. Bacteriophage proteins (DNA packaging protein, holin protein, lysin, tail, capsid, protease, terminase and integrase) and hypothetical proteins were the most frequent ones. The bacteriophages most encountered were Sha1 and Phig1, both isolated from L. plantarum. It suggests a high gene transfer rate between the strains. Table S3 shows in detail all the results from PHASTER (Arndt et al., 2016).
The results of bacteriocin identification/annotation are shown in Figure 6. All the strains harboured at least one bacteriocin gene, especially of Plantaricin -A, -K, -J, -N, -E and -F classes. Notably, only L. plantarum Q7 strain has Pediocin PA-1, the most extensively studied class Ila (or pediocin

| 597
L. PLANTARUM PAN-GENOME ANALYSIS family) bacteriocin, which shows a particularly strong activity against Listeria monocytogenes, a foodborne pathogen of special concern among the food industries (Rodríguez et al., 2002). A total of 101 L. plantarum strains carried at least one CRISPRs array or one Cas cluster (Table S4). The number of CRISPR arrays varied from five to one, with ATG-K6 and DSM_16365 strains carrying the highest number of CRISPRs arrays each (N = 5), along with two different Cas cluster types, CAS-TypeIIA and CAS-TypeIE, respectively. Conversely, two strains, SRCM103472 and TMW 1.1478, displayed no CRISPRs arrays, while harbouring a CAS-TypeIIA and a CAS clusters. Overall, the majority (N = 61) of the identified 101 L. plantarum strains carrying CRISPR-Cas systems components, harboured a single, small CRISPRs array (with 3 or less spacers). Similarly, only 41 strains included a single Cas cluster in their genome, except for LQ80 strain, which harboured two Cas clusters. Of the 41 strains with identified Cas clusters, 35 had CAS-TypeIIA, whereas the remaining ones had CAS-TypeIE (N = 2), CAS (N = 2) and CAS-TypeIA.
Only the L. plantarum 12_3 strain was identified as carrier of acquired antibiotic resistance genes, with the presence of an ANT(6) gene, producing an aminoglycoside antibiotic.

'Probiotic marker genes' in L. plantarum pangenome
A probiotic bacterium should have the ability to survive, and transiently persist, in the gastrointestinal tract where has to be able to exert a beneficial effect. Apart from MGEs and adaptative immune systems genes, the ability to resist host stressful conditions and to hydrolyse bile salts is of paramount relevance when determining the key genes defining a candidate bacterium with potential probiotic potential. Lebeer et al., (2008) provided a comprehensive summary of Lactobacillus genes involved in stress resistance, active metabolism in the host, adhesion and putative probiotic functions. Combining their list with the results obtained from more recent papers, we created an updated list of 'probiotic marker genes' (PMGs) responsible for stress resistance (acid, osmotic, oxidative, temperature), bile salt hydrolase activity, adhesion ability and gut persistence to find out any unique features across all the L. plantarum strains included in the present study. The annotation and the presence/absence status of the 75 identified PMGs are reported as Table S5.

DISCUSSION
This study provides a comprehensive pan-genome analysis of L. plantarum, including the largest number (N = 127) of complete L. plantarum genomes retrieved from publicly available repositories. Our effort aimed to determine a solid reference panel for the future characterization of newly sequenced L. plantarum strains useful as probiotic supplements. Indeed, we paid particular attention in depicting the probiotic potential of each strain included in the analysis, through the identification and characterization of their plasmid content, MGEs, adaptative immune system and PMGs.

| 601
L. PLANTARUM PAN-GENOME ANALYSIS Moreover, the dissection of L. plantarum pan-genome into the four different gene categories ('core', 'soft core', 'shell' and 'cloud') will facilitate genetic engineering strategies for genomic reduction/optimization. Furthermore, our results showed that phylogenetic tree analyses represent a powerful methodology to predict potential outliers and to elucidate the real isolation source of the strains by helping to address misannotation and cross-contamination issues.
Understanding the origin of isolation of each strain and their niche-specific adaptation can be of particular relevance for their further applications to improve probiotic efficacy and industrial workhorses.
Several important features separate our work from previous studies looking at the pan-genome or for general comparative genomic analysis of L. plantarum (Choi et al., 2018;Evanovich et al., 2019;Inglin et al., 2018).
First, and most critically, we considered only the L. plantarum strains for which a complete genome was available. Indeed, it becomes obvious that the inclusion of genomes at their draft stage in a pan-genome analysis can lead to severe biases that may compromise both data analysis and their interpretation. Not surprisingly, several recently developed tools are aiming to maximize bacterial pan-genome analyses by adopting ad-hoc strategies for the inclusion of draft F I G U R E 6 Bacteriocin identification/annotation heatmap of the 127 Lactiplantibacillus plantarum genomes. The heatmap reports the normalized scalar values of bacteriocin identification/annotation by BAGEL4 for each strain, using as lowest threshold a sequence homology of 50%