Genomic and phenotypic description of the newly isolated human species Collinsella bouchesdurhonensis sp. nov.

Abstract Using culturomics, a recently developed strategy based on diversified culture conditions for the isolation of previously uncultured bacteria, we isolated strain Marseille‐P3296T from a fecal sample of a healthy pygmy female. A multiphasic approach, taxono‐genomics, was used to describe the major characteristics of this anaerobic and gram‐positive bacillus that is unable to sporulate and is not motile. The genome of this bacterium is 1,878,572 bp‐long with a 57.94 mol% G + C content. On the basis of these characteristics and after comparison with its closest phylogenetic neighbors, we are confident that strain Marseille‐P3296T (=CCUG 70328 = CSUR P3296) is the type strain of a novel species for which we propose the name Collinsella bouchesdurhonensis sp. nov.


| INTRODUCTION
Over the past decade, metagenomics has been extensively adopted to enhance the description of the human gut microbiota (Gill et al., 2006;Ley, Turnbaugh, Klein, & Gordon, 2006;Ley et al., 2005). However, many detected DNA sequences cannot be assigned to as-yet cultured microorganisms, suggesting that a significant part of the human gut microbiota content remains to be isolated and described (Rinke et al., 2013). Thus, the ability to culture and obtain pure bacterial colonies is mandatory for a better description, analysis, and correlation with health and diseases (Lagier et al., 2015). Stool samples are recognized as a representative model of the gut microbiota (Raoult & Henrissat, 2014). To improve bacterial isolation from these specimens, a strategy named culturomics was recently developed (Lagier, Armougom, Million et al., 2012;Lagier, Armougom, Mishra, et al., 2012;Lagier, El Karkouri, et al., 2012Mishra, Lagier, Robert, Raoult, & Fournier, 2012Seng et al., 2009). This method relies on culturing samples using diversified combinations of culture media, temperatures, atmospheres, and incubation times. All isolated colonies from a specimen are identified using matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF MS). In case of identification's failure by MALDI-TOF MS, 16S rRNA amplification, and sequencing is systematically performed for further phylogenetic analysis with closely related bacterial species (Lagier, Armougom, Mishra, et al. 2012;Lagier, El Karkouri,et al 2012;Mishra et al., 2012;Seng et al., 2009). Prior to the development of culturomics, only 688 bacteria were reported to be isolated from the human gut . To date, culturomics has permitted the isolation of more than 1,000 distinct human gut bacterial species, including a significant number of novel species . Using culturomics, we report the isolation of strain Marseille-P3296 T from a human stool sample, which we believe to be the representative strain of a new Collinsella species and propose the name Collinsella bouchesdurhonensis (C. bouchesdurhonensis) sp.
nov. The Collinsella genus was first described by Kageyama, Benno, & Nakase, (1999), after the reclassification of Eubacterium aerofaciens into a new genus on the basis of a high 16S rRNA gene sequence divergence with other members of the Eubacterium genus. In addition to the type species C. aerofaciens (Kageyama et al., 1999), the Collinsella genus currently contains C. intestinalis (Kageyama & Benno, 2000),C.
Members of the Collinsella genus are gram-positive anaerobic bacilli from the human gut microbiota that were suggested to play a role in health and diseases such as rheumatoid arthritis (Chen et al., 2016) and irritable bowel syndrome (Lee & Bak, 2011). They were also isolated from patients with Crohn's disease, ulcerative colitis, and colon cancer (Whitman et al., 2012).

| Ethics and sample collection
The feces donor is a healthy 50-year-old pygmy female from Congo. She gave an informed and signed consent. Samples were stored at the URMITE laboratory (Marseille, France) at −80°C for further analysis. The culturomics study was approved by the ethics committee of the Institut Fédératif de Recherche 48 under number 09-022.

| Strain isolation
After being diluted with phosphate-buffered saline (Life Technologies, Carlsbad, CA), stool samples were incubated in an anaerobic blood culture bottle (BD BACTEC ® , Plus Anaerobic/F Media, Le Pont de Claix, France) supplemented with 5% sheep blood and 5% sterile-filtered rumen at 37°C. Following initial growth in the blood culture vial, the bacterium was subcultured on 5% sheep blood-enriched Columbia agar (bioMérieux, Marcy l'Etoile, France).

| Identification of colonies identification
Colony identification was performed by the mean of MALDI-TOF MS as previously described (Elsawi et al., 2017;Lagier, El Karkouri, Nguyen, Armougom, Raoult, and Fournier, 2012). For strains that were not identified, 16S rRNA gene amplification and sequencing was performed as formerly done (Morel et al., 2015). Then, sequences were assembled and modified using the CodonCode Aligner software (http://www.codoncode.com) and a comparative analysis with the sequences of closely related species was performed using the BLAST software (http://blast.ncbi.nlm.nih. gov.gate1.inist.fr/Blast.cgi). A new species was considered when the similarity threshold between the 16S rRNA gene sequence of the understudied strain and its closest phylogenetic species with stranding in nomenclature was below 98.65% or considered as a new genus in case the similarity threshold was below 95% (Kim, Oh, Park, & Chun, 2014). The generated mass spectrum and 16S rRNA gene sequence were added to the UR-MS and the EMBL-EBI databases, respectively.

| Morphological and biochemical assays
The biochemical characteristics of strain Marseille-P3296 T were described using multiple API strips (ZYM, 20A and 50CH, bioMérieux), using bacteria that had been cultivated on 5% sheep bloodenriched Columbia agar, in anaerobic atmosphere at 37°C for 24 hr.
In addition, the sporulation ability was tested by exposing a bacterial suspension to a thermic shock (80°C) for 20 min. Gram staining and cell motility were observed under a 100X magnification using a DM1000 photonic microscope (Leica Microsystems, Nanterre, France). Finally, cell morphology was observed by means of a Tecnai G20 Cryo (FEI) transmission electron microscope as previously described (Elsawi et al., 2017).

| FAME analysis
GC/MS was used for cellular fatty acid methyl ester (FAME) analysis after preparing two cell samples containing around 15 mg of bacterial biomass per tube and analysis was performed as previously described . Moreover, Elite 5-MS column was used for FAME separation and checked by mass spectrometry (Clarus 500-SQ 8 S, Perkin Elmer, Courtaboeuf, France). Then, the FAME mass spectral database (Wiley, Chichester, UK) and MS Search 2.0 operated with the Standard Reference Database 1A (NIST, Gaithersburg, USA) were used for FAME identification.

| DNA extraction and genome sequencing
Strain Marseille-P3296 T genomic DNA (gDNA) was extracted after subjecting it first to a mechanical treatment with the FastPrep BIO 101 instrument (Qbiogene, Strasbourg, France) using acid washed glass beads (Sigma) at a speed of 6.5 m/s for 90 s. Then, an incubation of 3 hr with lysozyme was done at 37°C for a DNA extraction assay using the EZ1 biorobot (Qiagen). The elution volume was set to 50 μl and the final gDNA concentration (25.8 ng/μl) was determined using a Qubit assay with the high sensitivity kit (Life technologies, Carlsbad, CA, USA).
The gDNA from strain Marseille-P3296 T was sequenced using a MiSeq sequencer (Illumina Inc, San Diego, CA, USA) and the Mate Pair strategy as formerly described (Elsawi et al., 2017). The size of the DNA fragments obtained ranged from 1.5 kb to up to 11 kb with an optimal size at 7.933 kb. Additionally, a circularization of 600 ng of tagmented DNA was done with no size selection. Mechanical shearing of the circularized DNA was performed using the Covaris device S2 in T6 tubes (Covaris, Woburn, MA, USA) in order to obtain small DNA fragments with an optimal size of 981 bp.
High Sensitivity Bioanalyzer LabChip (Agilent Technologies) was used for libraries profile visualizations and a 19.28 nmol/L final concentration was measured. Normalization of the libraries was done at 2 nmol/L and pooled. Then, following denaturation, libraries were diluted to reach a concentration of 15 pmol/L. Libraries were loaded onto the reagent cartridge and then onto the instrument along with the flow cell. Cluster generation was done automatically and a single 2 x 251-bp run was performed for sequencing.
A total of 9.5 Gb information was acquired from a cluster density of 1,050 K/mm 2 with a quality threshold of 92.5% (18,644,000 passing filter paired reads). Strain Marseille-P3296 T index representation in this run was 7.78%. The 1,451,051 paired reads were trimmed and then assembled.
Gap reduction was performed using GapCloser tool for each assembly (Luo et al., 2012). Then, phage Phix was adapted for contamination detection and elimination. Finally, any scaffolds with a size lower than 800 bp or with a depth value less than 25% of the mean depth, were considered as potential contaminants and were eliminated. The best assembly was chosen based on multiple characteristics such as number of Ns, N50 and number of scaffolds. For the studied strain, the Velvet software gave the best assembly, with a mean depth coverage of 386.

| Genome annotation
The Prodigal software (Hyatt et al., 2010) was used with default parameters for Open Reading Frames (ORFs) detection and any ORFs that are spanning a sequencing gap were excluded.
The Clusters of Orthologous Groups (COG) database was used for bacterial protein sequences detection using BLASTP as previously described (Elsawi et al., 2017). Also, transfer RNA genes were searched using the tRNAScanSE (Lowe & Eddy, 1997) software and RNAmmer tool was used for ribosomal RNA genes detection (Lagesen et al., 2007). Moreover, Phobius (Käll, Krogh, & Sonnhammer, 2004) was used for lipoprotein signal peptides and transmembrane helices detection. ORFans were identified based on the BLASTP results as previously determined (Elsawi et al., 2017).

| 16S rRNA phylogenetic analysis
Sequences of the strains to be considered in the phylogenetic analyses were obtained after performing a BLASTn search against the 16S rRNA database of "The All-Species Living Tree" Project of Silva (Yilmaz et al., 2014). Sequences were aligned with CLUSTALW (Thompson, Higgins, Gibson, & Gibson, 1994) and MEGA software (Kumar, Tamura, & Nei, 1994) was used for phylogenetic inferences generation with the maximum-likelihood method.

| Genome comparison analysis
GenBank was used for retrieving the ORFeome, proteome, and complete sequences of the species being used for the comparative analyses. Nevertheless, the species were automatically recovered using Phylopattern (Gouret, Thompson, & Pontarotti, 2009) from the 16S rRNA tree. If no genome was available for a specific strain, a complete genome from another strain of the same species was used. If ORFeome and proteome were not predicted, Prodigal was used with default parameters to predict them from the genome sequence. ProteinOrtho was used for proteome analyses (Lechner et al., 2011). Then, AGIOS similarity scores were obtained with MAGI software for each couple of genomes (Ramasamy et al., 2014). These scores represent the mean value of nucleotide similarity between all couples of orthologous genes of the selected genomes. Proteome annotation was also performed for functional classes of predicted genes determination based on the clusters of orthologous groups of proteins (as was performed for genome annotation).
The comparison protocol was done with DAGOBAH (Gouret et al., 2011) which provided pipeline analysis, and using Phylopattern (Gouret et al., 2009) for tree manipulation.

| Strain identification
MALDI-TOF MS could not identify strain Marseille-P3296 T . Hence, its type spectrum (Figure 1a) was added to the URMS database and compared to spectra from other closely related species (Figure 1b). The
Cells were also gelatin-positive, but β-glucosidase-negative. A comparison of phenotypic and biochemical features between compared species is presented in Table 2.By comparison with other studied species, Strain Marseille-P3296 T differed in a combination of production of N-acetylβ-glucosamine, absence of β-galactosidase activity, and acidification of D-mannitol. In contrast, all compared Collinsella species included Gram-positive bacilli that were unable to sporulate and did neither exhibit any urease activity nor could produce acid from L-

Cell shape Bacillus
Sporulation Negative

| Genome characteristics of strain marseille-P3296 T
The genome of strain Marseille P3296 T is 1,878,572-bp long with a 57.94 mol% G + C content. It is made of 15 scaffolds (for a total of 24 contigs). Of the 1,711 predicted genes, 51 are RNAs (1 16S rRNA, 1 5S rRNA, 48 tRNA genes, and 1 23S rRNA) and 1,660 are proteincoding genes. Fifty-two genes are detected as ORFans (3,17%) and 1,348 genes (82,15%) are assigned a putative function (by BLAST against nr or COGs). The 193 genes (11.76%) remaining are defined as hypothetical proteins (Table S2). A graphical representation of the genome from strain Marseille-P3296 T is presented in Figure S3. Table   S4 details the distribution of genes into COG functional categories.

| Comparative analysis between the genomes of strain Marseille-P3296 T and closely related species
The draft genome sequence of strain Marseille-P3296 T was compared to those of Collinsella aerofaciens (C. aerofaciens) (AAVN00000000), C.
As for the gene content of strain Marseille-P3296 T (1,711), it is smaller than those of C. massiliensis, C. aerofaciens, C. intestinalis, C. tanakaei, C. stercoris, O. uli, and C. glomerans (2,046, 2,222, 1,626, 2,258, 2,106, 1,827, and 1,859, respectively), but larger than those of A. minutum, A. rimae, and A. parvulum (1,593, 1,523, and 1,411, respectively). The distribution of functional classes of predicted genes of strain Marseille-P3296 T according to the COGs database is presented in Figure S4. The distribution was similar in all studied genomes.

| CONCLUSION
On the basis of genomic, phenotypic, and biochemical features, we suggest the creation of a new species, Collinsella bouchesdurhonensis sp. nov. Strain Marseille-P3296 T is the type strain of C. bouchesdurhonensis sp. nov.
Strain Marseille-P3296 T is a strictly anaerobic bacterium, able to grow at temperatures between 37°C and 42°C but optimally at 37°C). This strain can sustain a pH range of 6-8.5 and up to 5% NaCl concentration.
The major fatty acid is 9-octadecenoic acid.
The genome size 1,878,572-bp long with a 57.94 mol% G + C content. The 16S rRNA and genome sequences of C. bouchesdurhonensis sp. nov. are deposited in EMBL-EBI under accession numbers LT623900 and FTLD00000000, respectively. The type strain is Marseille-P3296 T (=CSURP3296 = CCUG70328) and was isolated from the stool sample of a healthy 50-year-old pygmy woman from Congo.

CONFLICTS OF INTEREST
No conflicts of interest are to be declared.