Bioinformatic survey of CRISPR loci across 15 Serratia species

Abstract The Clustered Regularly Interspaced Short Palindromic Repeats and CRISPR‐associated proteins (CRISPR–Cas) system of prokaryotes is an adaptative immune defense mechanism to protect themselves from invading genetic elements (e.g., phages and plasmids). Studies that describe the genetic organization of these prokaryotic systems have mainly reported on the Enterobacteriaceae family (now reorganized within the order of Enterobacterales). For some genera, data on CRISPR–Cas systems remain poor, as in the case of Serratia (now part of the Yersiniaceae family) where data are limited to a few genomes of the species marcescens. This study describes the detection, in silico, of CRISPR loci in 146 Serratia complete genomes and 336 high‐quality assemblies available for the species ficaria, fonticola, grimesii, inhibens, liquefaciens, marcescens, nematodiphila, odorifera, oryzae, plymuthica, proteomaculans, quinivorans, rubidaea, symbiotica, and ureilytica. Apart from subtypes I‐E and I‐F1 which had previously been identified in marcescens, we report that of I‐C and the I‐E unique locus 1, I‐E*, and I‐F1 unique locus 1. Analysis of the genomic contexts for CRISPR loci revealed mdtN‐phnP as the region mostly shared (grimesii, inhibens, marcescens, nematodiphila, plymuthica, rubidaea, and Serratia sp.). Three new contexts detected in genomes of rubidaea and fonticola (puu genes‐mnmA) and rubidaea (osmE‐soxG and ampC‐yebZ) were also found. The plasmid and/or phage origin of spacers was also established.

The genus Serratia, a Gram-negative rod, is now part of the family Yersiniaceae. Serratia species can be found in different environments (e.g., water, soil) and hosts (e.g., humans, insects, plants, vertebrates) where they may play different roles ranging from opportunistic pathogens to symbionts (Cristina et al., 2019;Gupta et al., 2021;Lo et al., 2016). Among Serratia species, marcescens is undoubtedly the most studied mainly for its role played as a symbiont associated with insects and nematodes  or as a human opportunistic pathogen (currently reported as one of the most important bacteria responsible for acquired hospital infections such as bacteremia, pneumonia, intravenous catheter-associated infections, and endocarditis) (Ferreira et al., 2020). Other Serratia species responsible (to a minor extent) for human bacteremia are liquefaciens and odorifera (Mahlen, 2011). A growing number of marcescens genomes have then been sequenced with a pangenome allele database available for different studies ranging from virulence and antibiotic resistance to the identification of CRISPR systems (Abreo & Altier, 2019). A number of studies, in addition to marcescens, have also been reported for other Serratia species that play different roles in human and insect pathogenesis (Petersen & Tisa, 2013). Although the characterization of CRISPR systems represents a valuable substrate for diagnostic, epidemiologic, and evolutionary analyses (Louwen et al., 2014), data on CRISPR-Cas systems in the genus are scarce and limited to the detection of subtypes I-E and I-F1 in genomes of the species marcescens (Medina-Aparicio et al., 2018;Scrascia et al., 2019;Srinivasan & Rajamohan, 2019;Vicente et al., 2016).
In this study, 146 Serratia complete genomes and 336 highquality assemblies are available for the species ficaria, fonticola, grimesii, inhibens, liquefaciens, marcescens, nematodiphila, odorifera, oryzae, plymuthica, proteomaculans, quinivorans, rubidaea, symbiotica, and ureilytica were explored for the presence and type of cas gene clusters and/or CRISPRs. Apart from subtypes I-E and I-F1, the study showed the presence (first detected in Serratia) of subtype I-C, the presence of unique loci, and detailed genomic contexts of CRISPR loci. The plasmid and/or phage origin of spacers was also assessed.
The discovery of CRISPR-Cas systems has allowed the development of new technology tools in the bioengineering field (Dong et al., 2021). A clear example is represented by gene editing strategies based on CRISPR/Cas9 technique successfully used in agriculture, nutrition, and human health (Nidhi et al., 2021). The development of new CRISPR-based applications also relies on the continuous update of CRISPR-Cas systems data and knowledge. Our study, in providing more comprehensive data on CRISPR loci in Serratia, has undoubtedly contributed to an expanded knowledge of these systems.

| Genomes analyzed
One hundred and forty-six Serratia complete genomes were considered in this study. The set of genomes encompasses the 15 S. marcescens complete genomes we previously analyzed (Scrascia et al., 2019) and those of the genus Serratia available at the CRISPR-Cas ++ database (https://crisprcas.i2bc.paris-saclay.fr/ MainDb/StrainList) up to December 12, 2020 (Couvin et al., 2018;Pourcel et al., 2020) (Supporting Information: Table S1). Among genome sequences available at the assembly level of scaffolds or contigs available at the National Center for Biotechnology Information database (NCBI) (https://www.ncbi.nlm.nih.gov/assembly) up to December 12, 2020, we selected the high-quality assemblies (N50 > 50 kb, i.e. 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than the 50 kb) that have been included in the study.

| Detection of CRISPR-Cas loci
Details about the detection of a cas gene cluster with associated arrays (CRISPR-Cas system) and CRISPR arrays only for complete genomes were retrieved from the CRISPR-Cas ++ database. CRISPR arrays recorded by CRISPR-Cas ++ were assigned to Levels 1-4 based on the criteria required to select the minimal structure of putative CRISPR as reported by Pourcel et al. (2020). Level 1 is the lowest level of confidence. Levels 2-4 were assigned based on the conservation of repeats (which must be high in a real CRISPR) and on the similarity of spacers (it must be low). Level 4 CRISPRs were defined as the most reliable ones. Levels 1-3 may correspond to false CRISPRs. In our study, only CRISPRs recorded with Level 4, were considered. CRISPRs without a set of cas genes in the host genome were defined as "orphans." Genomes harboring cas gene clusters were then submitted to the CRISPRone analysis suite (http://omics. informatics.indiana.edu/CRISPRone/) (Zhang & Ye, 2017) to graphically visualize the architecture of each cluster. The same suite was used to search and visualize cas gene clusters in the high-quality assemblies. A subtype of cas gene clusters was assigned according to the recent classification update for CRISPR-Cas systems (Makarova et al., 2020).

| In silico analyses of consensus of direct repeats
A consensus of direct repeats from CRISPRs was clustered by BLAST similarity. Some consensus DRs were manually trimmed when just a few terminal nucleotides were the only difference from the other members of the same cluster. The consensus DRs were used as input for CRISPRBank (http://crispr.otago.ac.nz/CRISPRBank/index.html) and CRISPR-Cas ++ to assign, based on identity with known consensus DRs (Biswas et al., 2016;Couvin et al., 2018;Pourcel et al., 2020), a specific CDR type to CRISPR. The CRISPRs whose CDR type was consistent with the subtype of the cas gene set harbored in the same genome were defined as "canonical." While those not consistent with the subtype of the cas gene set harbored in the same genome were defined as "alien." A schematic diagram of alien, canonical and orphan arrays is shown in Figure 1. consensus DRs and the number of repeats of the CRISPRs in the high-quality assemblies of Serratia sp. strains DD3, Ag1, and Ag2 were recovered from the CRISPRone output. Spacers' analysis for duplications (spacers of Ag1, Ag2, and DD3 included) was performed through the CRISPRCasdb spacer database at the CRISPRCas ++ site (https:// crisprcas.i2bc.paris-saclay.fr/MainDbQry/Index). Phagic and/or plasmidic origin of matching protospacers were searched at the CRISPRTarget site (http://crispr.otago.ac.nz/CRISPRTarget/crispr_ analysis.html) (Biswas et al., 2016).

| Genomic contexts of CRISPR-positive genomes
Analysis of CRISPR-positive complete genomes and high-quality assemblies was performed to better characterize the genomic context surrounding the cas gene sets and/or CRISPR arrays. Highquality assemblies with at least 4 kb flanking the cas gene sets were considered. These regions were annotated by Prokka (https://github. com/tseemann/prokka) (Seemann, 2014). Synteny was established by either the Mauve algorithm (http://darlinglab.org/mauve/mauve. html) (Darling et al., 2010) or visual inspection of annotated proteins.

| Phylogenetic analyses
The evolutionary relationship of Serratia strains found positive for cas genes sets was established and graphically depicted by the Cas3 sequence tree. All the protein sequences were aligned by the MUSCLE algorithm (https://www.ebi.ac.uk/Tools/msa/muscle/) (Edgar, 2004a(Edgar, , 2004b. The 16S rRNA gene tree was also drawn for comparison. Dendrograms were generated by the Neighbor-Joining clustering method and average distance trees with JalView (https:// www.jalview.org/) (Waterhouse et al., 2009). For the 16S rRNA gene F I G U R E 1 Schematic diagram of the three categories of arrays described in the study. DRs and spacers are depicted with diamonds and rectangles respectively. cas genes are shown as arrows pointing in the direction of transcription. The yellow color highlights the consistency between the DR type and the cas subtype; while the blue color indicates inconsistency. tree, the multiple sequence alignment was obtained by retrieving from one to seven full gene sequences (complete genomes) or truncated 16S rRNA gene sequences (high-quality assemblies). A phylogenetic tree was obtained by multiple alignment of all retrieved 16S rRNA genes; an abbreviated tree was constructed by using one sequence from each genome.
The I-E unique locus 1 was detected in two genomes of marcescens, the I-F1 unique locus 1 in eight genomes of marcescens, and one of grimesii. In three genomes of Serratia sp. (strains Ag1, Ag2, and DD3) an additional unique locus of the subtype I-E, identical to I-E* previously reported by Shen et al. (2017), was detected ( Figure 2b).
The locus I-E* identified in this study was characterized by the translocation of cas6e between cas7 and cas11, and the presence (upstream of cas3) of a gene harboring the WYL domain which encodes for a potential functional partner of the CARF (CRISPR-Cas Associated Rossmann Fold) superfamily proteins (Makarova et al., 2020). Proteins containing the WYL domain (name standing for the three conserved amino acids tryptophan, tyrosine, and leucine, respectively) have only been reported for subtypes I-D and VI-D (Makarova et al., 2014. The distribution of CRISPRpositive genomes, over the total analyzed, among Serratia species is shown in Figure 3. Coexistence in the same genome of different sets of cas genes was also detected: subtypes I-E and I-F1 were found in the single HQA of oryzae, while I-E* and I-F1were detected in two high-quality assemblies of Serratia sp. (strains Ag1 and Ag2) (Table A1).

| Consensus DRs and spacers
The 35 CRISPR-positive complete genomes harbored 78 CRISPRs of which 48 were canonical. The latter were distributed as follows: fonticola (4), inhibens (1), marcescens (19), plymuthica (5), rubidaea (15), and Serratia sp. (4). Twenty-three arrays were orphans and detected in genomes of and Serratia sp. (4) (Table 1; Figure 1). Alien arrays (8) were only detected in the species rubidaea. For a comprehensive analysis, arrays in the three high-quality assemblies Ag1, Ag2, and DD3 were included (Table A1). All disclosed CRISPRs were assigned, by comparative sequence analyses, to consensus DR types I-C, I-E, or I-F (Table 1). The association between consensus DR types and cas gene sets (canonical and unique loci) is reported in Table 2. Based on their nucleotide identity, the consensus DRs identified for subtype I-E and its unique loci (I-E* and unique locus 1) could be arranged into two clusters named consensus DR-I and consensus DR-II. consensus DR-I was composed of 6 consensus DRs (identity from 83% to 96%) and linked to the cas gene sets I-E and I-E unique locus 1. consensus DR-II was composed of 2 consensus DRs (identity of about 96%) and linked to the cas gene set I-E*. When the consensus DRs of the two clusters were compared to each other, the nucleotide identity dropped to 55%-62%.   The position of strains TEL in the cluster marcescens and JUb9 in the cluster rubidaea shown in the Cas3 phylogenetic tree was confirmed by the 16S rRNA gene tree, which might suggest a species assignment for these strains.

| CRISPR genomic contexts
The 35 CRISPR-positive complete genomes and 28 of the 46 CRISPR-positive high-quality assemblies were analyzed to identify Note: Palindrome identified in each consensus DR is underlined.
Abbreviation: CDR, consensus DR. a Consensus DR-I group.
b Consensus DR associated with the 20DRs array in Ag1 strain, the 3DRs array in Ag2 strain and the DD3 arrays (Table A1) | 9 of 22 assignment to rubidaea was assumed for the strain JUb9 (see above).
Detection of subtype I-C is the first report in Serratia. The prevalence of the subtype I-F1 in our subset of CRISPR-positive genomes is consistent with both the new reorganized Enterobacterales order (Adeolu et al., 2016) and data produced by Medina-Aparicio et al. This study also supplies data on the presence/number of CRISPRs and their consensus DRs sequences in Serratia. Apart from canonical arrays (61.5% of the total disclosed arrays), orphans (29.4%) and aliens (10.2%) arrays were also detected (Table 1; Figure 1).
Orphan arrays might represent remnants of previous complete CRISPR-Cas systems (Zhang & Ye, 2017). The presence of alien arrays found only in rubidaea complete genomes is, as far as we know, the first report in bacteria CRISPR-positive genomes. Its detection might be explained as traces of ancient complete CRISPR-Cas systems I-E/I-F1 or I-C/I-E/I-F1 coexistent within the same genome (Table 1). Alternatively, the aliens might result from single horizontal gene transfer events. Further analyses could unveil their genetic origin and the entity of their distribution among CRISPR-positive bacteria genomes. Detection of more alien arrays might unveil that the presence of multiple subtypes in a genome is more frequent than it has been reported so far. Furthermore, consensus DRs specifically associated with the cas gene set I-E* were also first described (Table 2).
Finally, the phylogenetic tree generated by multiple alignment of the Cas3 sequences showed a potential sub-lineage (I-E*) within the I-E branch and thus might represent and/or anticipate a distinct clonal expansion of an I-E sub-population (Figure 4).
Knowledge of CRISPR-Cas systems is constantly expanding due to studies on newly available genomic sequences or genomic sequences

ACKNOWLEDGMENTS
We would like to thank Karen Laxton and Julian Laurence for their writing assistance. There are no funding agencies to report for this article.

CONFLICT OF INTEREST
None declared.

DATA AVAILABILITY STATEMENT
All data supporting the findings of this study are available within the article (Appendix) and its Supporting Information files (Supporting Information: Table S1: List of Serratia genome assemblies; Supporting Information: Table S2: Spacer analyses; Supporting Information: Figure S1:    Possible multiple records of the same genome. Spacers' sequences were identical.