Genome‐scale metabolic modeling reveals key features of a minimal gene set

Abstract Mesoplasma florum, a fast‐growing near‐minimal organism, is a compelling model to explore rational genome designs. Using sequence and structural homology, the set of metabolic functions its genome encodes was identified, allowing the reconstruction of a metabolic network representing ˜ 30% of its protein‐coding genes. Growth medium simplification enabled substrate uptake and product secretion rate quantification which, along with experimental biomass composition, were integrated as species‐specific constraints to produce the functional iJL208 genome‐scale model (GEM) of metabolism. Genome‐wide expression and essentiality datasets as well as growth data on various carbohydrates were used to validate and refine iJL208. Discrepancies between model predictions and observations were mechanistically explained using protein structures and network analysis. iJL208 was also used to propose an in silico reduced genome. Comparing this prediction to the minimal cell JCVI‐syn3.0 and its parent JCVI‐syn1.0 revealed key features of a minimal gene set. iJL208 is a stepping‐stone toward model‐driven whole‐genome engineering.


Identification of protein molecular functions in M. florum L1
Mesoplasma florum L1 is a near-minimal bacterium that was originally isolated from a lemon tree flower (McCoy et al, 1984). This microorganism belongs to the Mollicutes class, a group of small wall-less bacteria with genome sizes varying from ~560 kb in the case of Mycoplasma genitalium (Fraser et al, 1995) to more than a million (~1.5 Mbp) for Acheloplasma laidlawii (Lazarev et al, 2011) (Appendix Table S1). Due to their very small genomes, many Mollicutes represent interesting candidates for the study of the minimal components capable of sustaining cellular life (Morowitz, 1984). In that context, multiple studies have provided a general understanding of the metabolism of these near-minimal cells (Miles, 1992;Pollack & Williams, 1996). While this understanding is useful and has previously been leveraged for genome-scale metabolic reconstructions of different Mollicute species (Suthers et al, 2009;Wodke et al, 2013;Bautista et al, 2013), studies specifically interested in M. florum are rare. Hence, we set to extract a maximal amount of functional information from the M. florum genome by: 1. comparing the proteome of M. florum with predicted proteins encoded by Mollicutes for which genome-scale models (GEM) have been published, 2. extracting enzyme commission (EC) numbers, 3. and generating three-dimensional structures through homology modelling.
Gene names are useful to query public databases for putative functions but were initially scarcely florum to reactions in the models. The gene to reaction mapping obtained for each model was converted to BiGG identifiers (King et al, 2016) using MetaNetX (Moretti et al, 2016). This draft was used to initiate the metabolic reconstruction process. Identified reactions were added using the SimPheny software (Schilling et al) and later extracted to be used with the COBRApy toolbox (Ebrahim et al, 2013).
identifications found by DETECT v2 were identical with at least one method. The remaining 13 identifications shared the first three EC digits with at least one other method. Of the 186 identifications found with a single method, 164 were specific to COFACTOR, 20 were specific to PATRIC and 2 were lower quality hits found only by DETECT v2. The consistency between EC number predictions was compared by matching the EC digits obtained with each method, showing that more than 87% have identical (113) or similar (67) EC digits ( Figure 1E, Materials and Methods).
Our study is not the first to use ad hoc reconstruction of 3D structures for genome-wide identification of protein functions (Yang & Tsui, 2018;Yang et al, 2019;Antczak et al, 2019). Of all four approaches used, COFACTOR identified the most EC numbers ( Figure 1D and Dataset EV3). Its predictions were nonetheless frequently different from standard, sequence-based methods. Albeit the potential for false positive identifications, these predictions were useful to formulate contextual hypotheses where the metabolic network suggested the need for a given function (see Figure 6). Faced with the great challenge of identifying several molecular functions required for synthetic biology, our study demonstrated the useful application of protein structures for the generation of testable hypotheses. With an increasing reliability of structure prediction algorithms (Billings et al, 2019;AlQuraishi, 2019;Senior et al, 2019), this type of approach is likely to gain interest.

Final annotation score
Our bioinformatic analysis allowed extracting extensive information from the genome and attributing a confidence score for each of those genes to function associations. While a similar approach using a combination of computational methods was previously used to predict unknown molecular functions in a genome (Ghatak et al, 2019), our rationale was that the cross-validations would increase confidence levels and establish a hierarchy in the current annotation ( Figure 1E).
Indeed, most of the proteins for which no gene name was identified also did not map to known structural domains or functional EC number. Overall, between 283 and 315 proteins, ~45% of the total proteins had poor mappings to EC number and quality structures, respectively. Crossvalidated top tier confidence proteins were scarcer with 156 identical gene names identified and 113 identical EC numbers identified through different methods which sums up to ~20% of the total proteins. The remaining ~35% of proteins had mixed identifications and cover the medium confidence range. To summarize the information contained in Figure 1E, a final annotation score was calculated (Materials and Methods and Figure 1F). These results demonstrate that the identification of molecular functions, even in small genomes, is far from complete. The high proportion of mitigated functions should stimulate the effort for experimental protein characterization (Glass et al, 2017).

Genome-scale metabolic network reconstruction
The reconstruction of the M. florum metabolic network was executed as described by Thiele and Palsson (Thiele & Palsson, 2010). To increase the reliability of the reconstruction, both GenBank and PATRIC (Wattam et al, 2017) genome annotations were used as reference annotations. The potential metabolic candidates were extracted based on EC numbers and product names. This information was used to query publicly available reaction databases (Kanehisa et al, 2017(Kanehisa et al, , 2016Kanehisa & Goto, 2000;Placzek et al, 2017;Artimo et al, 2012). The identified reactions were added using the SimPheny framework to ensure charge balance and conformity with an existing functional nomenclature (Dataset EV1). Refinement of the initial reconstruction was made by curating each metabolic objective individually and studying literature for biochemical evidence in M. florum. An interactive map of the entire reconstructed M. florum metabolic network is provided in a json format as Computer Code EV1. The final iJl208 GEM is also provided in json format as Computer Code EV2.
The details of this manual reconstruction process are presented here and divided in six sections.
Each section corresponds to a greater category named "Module". The utility of dividing the metabolism in such sections is to simplify future engineering tasks, a concept that was brought forth by Danchin and colleagues (Acevedo-Rocha et al, 2013;Danchin & Fang, 2016). The six modules presented here are: (1) Nucleotides, (2) Amino acids, (3) Energy, (4) Lipids, (5) Glycans, and (6) Vitamins & Cofactors ( Figure 3). The details of every module composition together with the model in a spreadsheet format are available in Dataset EV4.
EC numbers and gene names identified through the computational identification of molecular functions were used to attribute reactions to the genes in the network. Along with our reasoning during the reconstruction of the model, the next sections identify genes for which the manual curation raised questions or required comments. These areas of lesser knowledge would require further experimental biochemical characterization.

Nucleotides
The synthesis of nucleotides is fundamental to all life forms. Manual curation of the M. florum genome and identification of gene names and EC numbers revealed 44 genes associated with the Nucleotides module (Dataset EV4). This module is the largest by number of reactions but contains a lower number of genes (44) than the Energy module (57) (Figures 2 and 3). This module also holds the highest number of genes involved in multiple reactions (20). In particular, the pyruvate kinase (Mfl175) is involved in 10 reactions, the largest number for any gene in the model. Like in most Mollicutes, a dedicated nucleotide diphosphate kinase is absent in M. florum , and the pyruvate kinase was hypothesized to generate nucleotide triphosphates for all nucleotides.

Ribose
Ribose, which serves as a backbone for nucleotides, has an associated ABC transporter encoded  (Kanehisa et al, 2017).
The replacement EC number (6.3.4.21) was correctly identified by both PATRIC and DETECT while COFACTOR also attributed the old EC number 2.4.2.11. Given the high confidence in the EC number attributed to that gene we rename it pncB to be consistent with the Salmonella typhimurium annotation from which the function was fetched (Vinitsky & Grubmeyer, 1993).

DNA uptake
Previous studies have suggested that no transporter exists for nucleosides or nucleotides in Mollicutes (Pollack, 2002). Nevertheless, the manual curation of the genome allowed to identify two gene products that could satisfy the demand for individual nucleobases. Mfl413 and Mfl658 are both identified in RefSeq as Uracil/Xanthine permeases. The structure for a uracil permease (uraA) was previously generated (Lu et al, 2011) and the authors suggested a proton symport mechanism. The associated gene in M. florum and other Mollicutes is named pyrP. While the E. coli gene seemed specific to uracil, the current annotation for Mollicutes suggests the import of both a purine (Xanthine) and a pyrimidine (Uracil). Considering that those genes are the only two associated with nucleobases import in M. florum and the potential for promiscuity of reactions catalyzed by an organism whose genome has been reduced (Seelig, 2017), we initially included the import of all nucleobases for which a phosphoribosyltransferase reaction was annotated.
Hereby, we identified adenine, guanine, xanthine/hypoxanthine and uracil as the first attempt at characterizing essential nucleobases for M. florum.
Another possible system worth mentioning is catalyzed by the DNA uptake proteins (Mfl027 and Mfl329). As suggested before (Bizarro & Schuck, 2007;Pollack et al, 1997), in the Mollicutes' natural environment, DNA uptake may occur through the direct import of larger fragments of DNA from nearby dying cells (Pollack, 2002). In a laboratory setting the long DNA fragments could come from undefined media components such as yeast extract (YE). These large fragments could then be digested using exonucleases. Mfl055 is a 5'-3' exonuclease that could be used for this process. While this mechanism remains hypothetical, the possibility that long DNA fragments could be degraded by membrane associated nucleases and incorporated by a competency related protein should be kept in mind for further biochemical characterizations.
Finally, the whole network curation identified four putative essential components necessary for M. florum growth: adenine, guanine, thymidine and ribose. Although an entire ABC transport system is annotated for ribose (Mfl666, Mfl667, Mfl668), the exact nature and function for DNA uptake would require further characterization in a completely defined medium.

Phosphorylation of nucleotides
Monophosphate-nucleotides formed by the combination of the phosphorylated ribose backbone and the imported nucleobases need to be phosphorylated twice before they can be incorporated into macromolecules (DNA and RNA). The first phosphorylation step leads to the formation of In Mollicutes, the lack of annotation for a Nucleotide Diphosphate Kinase (NDPK) is common (Bizarro & Schuck, 2007;Pollack et al, 1997). It has been hypothesized that the relaxation of the catalytic site of the glycolytic enzyme pyruvate kinase (Mfl175, pyk) would allow it to phosphorylate other nucleotides than ADP . It has been reported that the Mollicute's pyk conserves 5 to 21% of its activity when using other substrates than ADP . For modeling purposes, reactions PYK2 to PYK10 (eight reactions) were added to ensure that all nucleotide-diphosphate could be converted into nucleotide-triphosphate, building blocks of DNA and RNA.

Ribonucleoside-diphosphate reductase
The conversion between deoxy-and ribonucleotides is ensured by ribonucleosides-diphosphate reductases. M. florum encodes a thioredoxin (Mfl178, trx) and a thioredoxin reductase (Mfl064, ntr). This system plays an important role in oxidoreductive balance and is present in Mollicutes (Ben-Menachem et al, 1997;Pollack et al, 1997). The conversion of all four nucleosides diphosphate into nucleotides di-phosphate is likely to be catalyzed by the putative trimer complex formed by Mfl528 (nrdA or nrdE), Mfl529 (nrdI) and Mfl530 (nrdF) in a promiscuous manner.

Synthases
A GMP synthase (Mfl342, guaA) activity was identified through the four EC number identification methods used (EC 6.3.5.2). This enzyme enables the conversion of L-glutamate to L-glutamine, consuming one Xanthosine 5'-phosphate and producing one GMP. The fact that this enzyme is kept in M. florum may indicate either the requirement for an easy conversion between amino acids in case of starvation or the need to utilize non-conventional nucleotides like XMP.
A thymidylate synthase (Mfl419, thyA) activity was identified through the four EC number identification methods used (EC 2.1.1.45). This enzyme converts dUMP into dTMP using folate as a cofactor (5,. This mechanism is likely conserved to ensure that accidental deoxidation of UMP into dUMP can be re-utilized. A cytidine triphosphate synthetase (Mfl648, pyrG) activity (EC 6.3.4.2) was also identified through the four EC number identification methods used, namely RefSeq, PATRIC, DETECT v2, and COFACTOR (Datasets EV3 and EV4). This enzyme enables the production of CTP from UTP, converting glutamine into glutamate in the process. This process is likely conserved to ensure the availability of cytosine-based nucleotides and nucleosides. While the DNA uptake remains unclear given the current annotation of transporters, no specific transporter for cytosine was found.

Amino acids
De novo synthesis of amino acids is generally absent from Mollicutes species (Pollack, 2002), a feature that was also observed during the manual curation of the M. florum genome. Hence, salvage of free amino acids or oligopeptides appears as the only viable solution for M. florum to sustain protein production and growth. The possibility that both free amino acids and oligopeptides be imported in Mollicutes was previously discussed (Miles, 1992;Pollack, 2002;Yus et al, 2009). The Amino acids module is composed of two main transporter systems (single amino acid and peptides) that were suggested to import small peptides directly. These peptides are digested within the cell and the resulting amino acids are used to express proteins, a process that avoids the need for any energy expensive synthesis pathways. The apparent low number of transporters compared to the number of substrates (20 for all amino acids) has been suggested as biochemically possible (Hosie & Poole, 2001) and suiting the genome reduction history of Mollicutes (Pollack, 2002). Eight different gene products are annotated, revealing what could compose three different systems for amino acid and oligopeptides transport (Dataset EV4).
In M. pneumoniae, no amino acid can be synthesized de novo, and the defined growth medium previously developed by Yus and colleagues provides all amino acids (Yus et al, 2009). The decision was made to include exchange reactions for each amino acid into the M. florum model.
For the import of oligopeptides, we referred to the strategy proposed for the M. genitalium model (iPS189) (Suthers et al, 2009). In iPS189, 15 dipeptide import reactions simulate the import of oligopeptides through oligopeptide ABC transporter and 14 reactions simulate the cleavage by a protease of these dipeptides into free amino acids that can be incorporated into proteins. These 29 reactions were imported from iPS189 and the gene-reaction rule was changed so that the Mfl094 to Mfl098 are associated with each of the import reactions. Eight proteases/dipeptidases are annotated in M. florum based on GenBank (Dataset EV4).
Despite further evidence, all proteases but two were linked to these dipeptide cleavage reactions.
The two proteases not taking part in cleavage of imported oligopeptides are the cell-division associated RasP/YluC (Mfl287), potentially involved in cell division, and the DNA-binding Lon protease (Mfl404) which could be linked to the heat-shock response.
The free amino acid import was hypothesized to be mediated via either Mfl605 or the complex formed by Mfl183 and Mfl184. Since Mfl183 and Mfl184 are annotated as hypothetical proteins in PATRIC, no further constraint was added through the addition of an ABC transport system requiring ATP. Instead, all free amino acid import reactions were considered to be proton symport.

Energy
The Energy module contains reactions associated with the production of ATP, alternate carbon metabolism reactions, oxidoreduction balance, pyruvate metabolism and an ATP pump. As in most Mollicutes (Miles, 1992), the tricarboxylic acid (TCA) cycle is absent from M. florum and glycolysis is the only ATP generating pathway, with lactate and acetate being the two possible fermentation by-products.

Glycolysis
Although carbon sources utilized in glycolysis are predicted to be phosphorylated through PTSassociated transport, M. florum's genome entails two sugar kinases. Glucose kinase (Mfl497) probably phosphorylates the remaining phosphate-free glucose molecule after trehalose is cleaved by trehalose-6-phopshate hydrolase. On the other hand, fructose kinase (Mfl514) most likely phosphorylates fructose after the sucrose molecule is cleaved by sucrose-6-phosphate hydrolase (Mfl515 or Mfl526). Aside from this initial phosphorylation step, M. florum's glycolysis differs from some previously modelled Mollicutes at the glyceraldehyde-3-phosphate dehydrogenase step. In M. florum this enzyme has two versions. Mfl578 is annotated as the standard NAD dependent dehydrogenase converting glyceraldehyde-3-phosphate (g3p) into 3phospho-glycerol phosphate (13dpg), a reducing reaction that produces NADH. The alternative reaction is catalyzed by gene Mfl259 (both PATRIC and GenBank annotations agree for NADP specificity) and converts g3p into 3-phospho-glycerate (3pg), bypassing phosphoglycerate kinase (Mfl577) reaction. This reaction utilizes NADP and generates NADPH.

ATPase pump
As for other Mollicutes, M. florum possesses an ATPase pump. Contrary to previous observations that the ATPase of Mollicutes is composed of seven genes (Béven et al, 2012), in M. florum, a cluster of eight genes (Mfl109 to Mfl116, inclusively) is proposed to form this complex. Unlike other bacteria where the F1F0 ATPase is used to generate energy from a proton gradient, in Mollicutes the ATPase is believed to be used by the cell to maintain an electro-chemical gradient at the cost of ATP. As previously reported the ATPase pump is also essential in M. florum with all eight genes identified as essential and six of the eight genes untouched by any transposon (Dataset EV6). The first and the last gene in the genomic sequence were hit by a transposon only in the terminal part of the gene (the last 20%) which could still allow for the complex to form.

Secretion products
In M. florum, the enzyme lactate dehydrogenase (LDH: Mfl596) and the pyruvate dehydrogenase complex (PDH: Mfl039, Mfl040, Mfl041, and Mfl042) are annotated and would allow two outcomes for pyruvate. The first path through lactate leads to the production of NAD+ and lactate. NAD+ is used in glycolysis again whereas lactate needs to disappear from the system. No transporter was annotated for lactate, hence the orphan reaction L-LACt and the sink SK_L_LAC were added to the network creating an escape route for lactate. The PDH path leads to the formation of acetate for which no transporter was annotated either. Again, two orphan reactions were added to eliminate acetate from the system, a transport (ACtr) and a sink (SK_AC).

Lipids
The Lipids module contains the necessary machinery to synthesize the single M. florum cell membrane. Whole fatty acids are imported through two lipid transport proteins (Mfl590 and Mfl591). These fatty acids are then fixed to a glycerol backbone in a process dependent on the acyl-carrier protein (ACP, Mfl593). In the model, this generic glycerolipid is used to form the different lipid species previously detected in M. florum (Matteau et al, 2020).

Identification of lipid synthesis genes
The lipid synthesis network in M. florum was reconstructed using the available annotations (Datasets EV3 and EV4) and previously published experimental lipidomic data (Matteau et al, 2020). Most Mollicutes do not possess the ability to generate long chain fatty acid, an energy extensive process (Pollack et al, 1997). Lipid metabolism and requirements in Mollicutes is hard to assess (Yus et al, 2009). Despite their genetic simplicity Mollicutes have conserved a rather high level of lipid complexity (Pollack et al, 1997). Although some studies have shown that A.
laidlawii can execute de novo synthesis of fatty acids, the majority of less complex Mollicutes cannot execute this task since they appear to lack the necessary machinery, and also because the metabolic cost of fatty acid elongation (32 mole ATP/fatty acid) could be too high for these scavengers (Heath et al, 2002).

Experimental identification of lipid species
The previous characterization of the M. florum membrane composition by lipidomic (Matteau et al, 2020) was used as a guide for the identification of potential end goals of metabolic pathways.
Nevertheless, these results were generated in the rich ATCC 1161 medium that contains undefined lipid species. The possibility that these lipids are residual from the undefined growth medium cannot be ruled out, even considering the efforts that were made to perform adequate washes of the cells before the experiment and the algorithmic noise reduction applied on these results. The lipidomic results were therefore evaluated when reconstructing the metabolic network and the reactions necessary for the production of these lipid species were added to the model (adding orphan or promiscuous reactions when necessary).

General mechanism
The mechanism for the production of lipid classes from Matteau and colleagues (Matteau et al, 2020) was assumed to be dependent on the Acyl-Carrier Protein (ACP, Mfl593). This highly conserved protein (Byers & Gong, 2007) can fix free fatty acids (FFA). The FFA transport system is potentially executed by Mfl590 and Mfl591, both annotated as "fatty acid binding/lipid transport protein" (GenBank) or a DegV family protein (PATRIC) in Pfam. The decision was made to use a single FFA (Octadecanoate (n-C18:0)) to serve as the fatty acid chain for all lipid classes in the model. The elongation of fatty acids is generally absent in Mollicutes (Pollack et al, 1997) and no gene was identified that could catalyze this process. Since no elongation was modelled, the length of the fatty acid does not add a constraint on the system. If M.florum is presented with many different FFA in complex growth media, these FFA may be imported in the cell and next included in the cytoplasmic membrane. Upon activation of the ACP (Mfl384), the putative mechanism would involve the fixation of the FFA by an acyltransferase (Mfl607) yielding a FFA bound ACP.
Fixation of the fatty acid chain to the glycerol backbone requires the production of a phosphorylated FFA (Mfl230) that can be fixed to the glycerol backbone (Mfl337). This FFA bound glycerol is converted into phosphatidic acid upon fixation of another fatty acid (Mfl382).

Phosphatidic acid derivatives
Previous

Di-acylglycerol (DAG)
Diacylglycerol can be formed from PA. The current annotation does not contain any phosphatidate phosphatase that would be required for the synthesis of this metabolite. An orphan reaction (PAPA180) was added to satisfy this need.

Cardiolipin and phosphatidylglycerol
Cardiolipin is a component of cell membrane in all three domains of life (Schlame, 2008) and a ubiquitous component of the core biomass of prokaryotes as revealed by Xavier and colleagues (Xavier et al, 2017). While cardiolipin was not specifically identified in previous M. florum lipidomics experiments, Mfl626 is annotated as a cardiolipin synthase in both RefSeq and PATRIC. Phosphatidylglycerol (PG) was detected in lipidomics data and may be produced by M. florum. The presence of PG could be associated with cardiolipin due to its structure (also known as di-phosphoglycerol). The entire pathway for the synthesis of cardiolipin is annotated in M. florum so the reactions were added to the model.

Sphingomyelin
Sphingomyelin was ranked first in relative lipid abundance in previously generated M. florum lipidomic data (Matteau et al, 2020). Nevertheless, bacteria do not possess the capacity to produce sphingomyelin, an essential component of nerve tissue in mammalian cells (Oshida et al, 2003). Therefore, if this compound is present in the M. florum membrane it would be the result of a direct salvage from the environment. It has been reported that Mollicutes possess lipid salvage capability (Salman & Rottem, 1995;Saito et al, 1978). Sphingomyelin has also been shown to favor growth in a defined medium for some Spiroplasma species (Hackett et al, 1987).
Despite these observations, no gene could be attributed to the import of sphingomyelin by M. florum. We hypothesized that the favored growth in presence of sphingomyelin was due to the increased lipid solubility which would facilitate the import of FFA from the medium. Given its high abundance in the published lipidomic dataset, sphingomyelin was added to the model and to the BOF. Characterizing the cell membrane again, in a completely defined medium, would help determine the role and importance of sphingomyelin in M. florum.

Glycans
A similar data-driven approach was used for the reconstruction of the Glycans module, which contains the reactions responsible for the synthesis of the extracellular polysaccharide layer previously described for M. florum (Matteau et al, 2020). For modelling purposes, the synthesis of the capsular polysaccharides (CPS) was assumed to include the conversion of sugars 1. The sugars are imported and phosphorylated in the process, usually via a PTS.
2. The phosphate group is transferred onto the first carbon, hereby labelling the sugar for polysaccharide synthesis.
3. The individual sugars are fixed to a triphosphate nucleotide via a nucleotidyltransferase.
4. The sugars are polymerized into a chain by a glycosyltransferase, using the energy contained in the phosphate bond with the nucleotide diphosphate.
5. The polymerized glycan is flipped on the extracellular side of the membrane by a flippase.
We suggested that a glucose transporter, either Mfl217 or Mfl187, could be promiscuous and allow the entry of sugar molecules composing the CPS. Interestingly, the M. florum GC-MS analysis also revealed the presence of rhamnose. While this sugar is similar to the other two, it lacks a hydroxyl group which is necessary for its phosphorylation upon entry. Therefore, the import of rhamnose was not associated with a gene and is included in the functions in search for a gene (Dataset EV4).
Sugars imported through the PTS should be phosphorylated on carbon 6. In order to be included in a polysaccharide, the phosphate group should be transferred on the first carbon. Mfl120 is annotated in RefSeq as a phosphomannomutase while being a D-Ribose 1,5-phosphomutase in PATRIC. Also, our re-annotation process allowed identifying three different EC numbers for this protein (Datasets EV3 and EV4). Together, these observations suggest that this enzyme is promiscuous (see Figure 6B). The conversion of hexose-6-phosphate to hexose-1-phosphate was therefore assigned to this gene for all sugars. Hexose-1-phosphate sugars are fixed to a nucleotide-triphosphate via a nucleotidyl transferase/hydrolase. We suggest that Mfl245 occupies that function for all sugars. Aside

Vitamins & Cofactors
As for lipids and glycans, the synthesis of vitamins and cofactors in M. florum is very minimal. We describe here the pathways leading to the import and utilization of coenzymes that were identified in the annotation and used in the reconstruction process (Datasets EV3 and EV4).

Nicotinamide adenine dinucleotide
Both phosphorylated and unphosphorylated forms of nicotinamide adenine dinucleotide (NADP and NAD) are found in reactions of the metabolic network. Additionally, this coenzyme has a detailed pathway for incorporation in M. florum. While no transporter is specifically annotated for its import, NAD is a combination of two nucleotides joined by their phosphate groups, and it is possible that this configuration allows it to be imported through the same transporter as nucleobases (discussed above).

Coenzyme A
In the metabolic network, Coenzyme A (CoA) is used in the biosynthesis of lipids to activate the apo-Acyl-carrier protein and in the PDH complex. Metabolically speaking, this coenzyme is therefore essential. Nevertheless, its import and synthesis remain to be characterized in M. florum. Indeed, no transport reaction could be found that imports CoA specifically and a single enzyme, diphosphoCOA kinase (Mfl281), is annotated.

Lipoate
Lipoate is present in the PDH complex where a lipoyl-adenylate protein ligase (Mfl038) is present.
The import of that coenzyme is absent as well as a potential pathway to its synthesis.

Thiamine
A thiamine diphosphokinase (Mfl224) is annotated in PATRIC. A consistent EC number (EC 2.7.6.2) was attributed to this gene by both PATRIC and COFACTOR, which is interesting since RefSeq identified it as a "hypothetical protein". The presence of thiamine in the network is therefore supported by this annotation. The other is a putrescine/ornithine APC transporter (Mfl664).

Minerals
The import of minerals in the metabolic network was first considered based on the known annotation. The manual curation of the genome identified inorganic phosphate, magnesium, cobalt, zinc, potassium, and sodium as potentially imported ions. Some key minerals were not gene-associated but nevertheless included in the model since they represent universally essential cofactors in prokaryotes (Xavier et al, 2017).
The import of inorganic phosphate is also annotated in JCVI-syn3A (Breuer et al, 2019) and was associated with three genes in M. florum (Mfl233, Mfl234 and Mfl235). Two EC numbers (EC 3.6.3.27 or 3.6.3.33) could be identified for one of these genes (Mfl235), the ATP-binding protein of the complex. Interestingly, this three-gene cluster has a transcriptional regulator right next to it, suggesting an operon-type regulation and an important feature of M. florum's metabolism.
Magnesium is essential for the polymerization of nucleic acids and a specific ATPase transporter is present to ensure its import (Mfl496). Other genes are also linked to its transport through the cell membrane (Mfl217 and Mfl356). These transporters nevertheless seem to serve a more general purpose of large cation import/export, as revealed by their annotation (Mfl217, Mg/Co/Ni Finally, potassium and sodium may be imported through a three-gene complex (Mfl164, Mfl165 and Mfl166) that are annotated as a K+, Na+ uptake protein integral membrane subunit, the trkA gene, and the trkH gene.

M. florum growth medium
The most common culture medium of M. florum is the ATCC 1161, a complex mixture of heart infusion broth, horse serum (HS), and YE. To test M. florum metabolic capabilities, we seek to replace these undefined components by a completely defined cell culture medium. We found that supplementing the commercial CMRL-1066 medium base with 0.313% HS and 0.02% YE, referred to as CSY, allowed significant growth only when a sugar source was also provided (Appendix Figure S3). Converting all medium components to BiGG identifiers (Norsigian et al, 2020) allowed comparing the composition of CMRL-1066 to metabolites in the reconstruction.
The metabolic reconstruction provided 84 transport reactions and extracellular metabolites. To simulate growth on CSY, the in silico minimal medium was defined using the COBRApy toolbox (Ebrahim et al, 2013) (Appendix Table S2). Of the 55 components included in CMRL-1066, 36 were present in the original extracellular metabolites and 19 were missing. The missing components were evaluated individually: -Trans-4-hydroxy-proline: hydroxyproline is a component of collagen. This metabolite is present in Saccharomyces cerevisiae where a hydroxyproline reductase is present (King et al, 2016). This reaction is absent from M. florum. Hydroxyproline is not likely to be necessary for M. florum growth and was therefore not added to the model. This EC number is absent in M. florum L1 (see Datasets EV3 and EV4). It is not impossible that 4-aminobenzoate is used to produce folate in M. florum, but not enough evidence is present to add the compound to the medium.
-Biotin: Biotin has been reported to be an essential cofactor in bacteria but is only found in pathways absent in M. florum (Salaemae et al, 2016). These pathways include fatty acid biosynthesis, replenishment of the TCA cycle and amino acid metabolism. Since M. florum does not contain a TCA cycle, nor any elaborate fatty acid or amino acid biosynthesis pathways, this vitamin is therefore not likely to be used.
-Thiamin diphosphate: a complete pathway with a specific transporter is annotated for thiamin in M. florum. This coenzyme is used by the PDH complex, an enzyme essential for the production of acetate in the presence of oxygen. Thiamin diphosphate is likely generated from intracellular thiamin imported from the growth medium. -Sulfate: cysteine desulfurase seems to be taking the role of providing the cell with sulfur, an essential metabolite.
-Glutathione: as a tripeptide, this medium component could be imported by the peptide importer system. Given its role as an antioxidant, its import could reduce the susceptibility to oxidative stress in M. florum when added to the medium. This hypothesis could be tested by growth assays under oxidative stress with or without glutathione.
-Glucuronate: this monosaccharide is involved in proteoglycan synthesis in many species.
It is likely that its import could be done by one of the sugar importers and it could potentially contribute to the synthesis of the M. florum glycans.
-Cholesterol: helps to solubilize FFA and facilitate their import.
-Tween 80: also helps to solubilize FFA and facilitate their import.

Biomass objective function
The biomass objective function (BOF) represents the sum of all metabolic goals of an organism in a given environment. In order to be representative of the cellular state, the biomass function should be derived from experimental measurements. Previous work yielded the detailed composition of M. florum biomass (Matteau et al, 2020). Along this data, the BOFdat software (Lachance et al, 2019) was used to determine the biomass precursors to include to the BOF and their respective stoichiometric coefficients (see Figure 5A). Genomic (DNA), transcriptomic (RNA) and proteomic (proteins) data along with macromolecular weight fractions (MWF) for each category were used as input to determine stoichiometric coefficients using the Step1 of BOFdat (Appendix Table S3).
The Step2 of BOFdat identified 16 coenzymes and cofactors to be added to the biomass. Ions that are commonly found in bacteria are also identified in the reconstruction. 12 ions were identified in this step (calcium, manganese, cobalt, molybdate, chloride, sodium, ammonium, zinc, potassium, nickel, magnesium and hydrogen). The only coenzyme identified was nicotinamide and its derivatives: oxidized and reduced versions of the phosphorylated and non-phosphorylated forms (Appendix Table S3 and Figure 5A).
Lipids and glycans were not added to the equation by the first and second step. The decision was made to forgo their addition in these steps since the experimental data required curation. Their inclusion, along with other metabolites, was left to the unbiased genetic algorithm performed in BOFdat Step3 ( Figure 5A, Table 2, and Appendix Figure S7). Accordingly, two lipids were identified (phosphatidylcholine, phosphatidylglycerophosphate). Supporting the evidence for the presence of phosphatidylcholine in the membrane was the identification of the phosphorylated version of choline. The Acyl-carrier protein was also added given its importance for the synthesis of lipids and its ubiquitous presence in prokaryotes. The capsular polysaccharide metabolite formulated during the reconstruction was also added during this step (Appendix Table S3).
Interestingly, S-adenosyl methionine was identified in BOFdat Step3. This metabolite is a common co-substrate involved in the transfer of methyl groups and is excessively important in many organisms. Consistent with this identification, methyltetrahydrofolate and sulfur were also identified and added during Step3.
The polyamines spermidine and putrescine were added during BOFdat Step3. The exact function of these metabolites is not precisely known in prokaryotes, but they are found widely across species. Putrescine was nevertheless not found in CMRL-1066. Depriving M. florum from either of these polyamines in a completely defined medium could shed light on their function in prokaryotes.
Finally, cytidine and adenosine were identified by BOFdat Step3, which can probably be attributed to the essentiality of the genes that make these specific metabolites. This likely means that some routes that were proposed in the nucleotide salvage pathway are not actually possible in vitro.

Sensitivity analysis
The main carbohydrate provided in ATCC 1161 medium is sucrose. Hence, when grown in CSY medium, M. florum was also provided with sucrose and its specific uptake rate was measured with high-performance liquid chromatography (HPLC) ( Figure 4E, Appendix Figures S4, S5AB and S6AB). The obtained value was -5.26 mmol per gram of M. florum dry weight per hour (gDW -1 •hr -1 ), which is similar to previously published results for glucose uptake rate in M. pneumoniae (7.37 mmol•gDW -1 •hr -1 ) and M. gallisepticum (16.53 mmol•gDW -1 •hr -1 ) (Wodke et al, 2013;Bautista et al, 2013) (Table 3). Nevertheless, to our knowledge, the sucrose uptake rate calculated here is a first amongst Mollicutes. This value is slightly lower than values previously observed for E. coli, which ranged between 7.01 and 14.10 mmol•gDW -1 •hr -1 following adaptive laboratory evolution (Mohamed et al, 2019).
The secretion rates were obtained for both lactate and acetate, the two possible fermentation products in M. florum (Figures 2 and 4F, Appendix Figure S5C and S6C  , pta), and an acetate kinase (Mfl044, ackA). Contrary to lactate production, the path to acetate releases CO2 as a metabolic waste, yields one ATP but reduces one NAD molecule into NADH. While the production of one molecule of ATP seems profitable for the cell, the one NADH molecule must be re-oxidized to make this process sustainable. One key reaction involved in this process is the NADH oxidase (Mfl037, nox). This enzyme uses molecular oxygen (O2) to convert NADH back to NAD. This reaction exists in two forms: H2O producing or H2O2 producing. While producing water molecules is not harmful for the cell, hydrogen peroxide is a toxic waste that needs to be eliminated. While this task could be achieved by the L-methionine S oxide reductase (Mfl050, msrA), the specificity to H2O2 is not confirmed. The final model therefore uses the NOX2 reaction (BiGG identifier), which produces water instead of hydrogen peroxide. Detecting the production of hydrogen peroxide by M. florum could shed light on this process.
The acetate production can be probed using iJL208. In the final version of the model, the lactate secretion was favored since the expression of the LDH was much higher than the PDH complex (Matteau et al, 2020). To favor the production of lactate, restrictive bounds were applied to key reactions. The upper bound to the NOX reaction was fixed at 5 mmol•gDW -1 •hr -1 . This limits the amount of oxygen that can be used to oxidize NADH back to NAD which can be used in the glycolysis. Since M. florum is a facultative aerobe, the logical decision was to limit the impact of oxygen on its growth phenotype. Considering that the NADH oxidase does not have an unlimited capacity was one option, the other was to reduce the possibility for oxygen import. This could be done by reducing the lower bound of its exchange reaction (EX_o2_e).
To probe the production range of acetate, the bounds can be changed on these critical reactions.
Providing equal lower and upper bounds on the secretion of lactate and acetate, here 0 and 10, releases the experimental constraints. With these bounds applied, increasing the limit oxygen uptake rate and the upper bound on the NADH oxidase reaction eventually results in a favorable utilization of the acetate secretion pathway. In this small case study, this was observed when the bounds were at 25, which is >1.5 times the upper bound on the secretion rates or approximately >5 times the sucrose uptake rate.
These model predictions stating that a very high amount of oxygen is required to efficiently produce acetate are consistent with the expression levels of ldh and pdh genes which suggest a higher production of lactate. The settings implemented in the final version of iJL208 ensured that the ATP synthase pump was essential as observed experimentally, which also supported the final choice of constraints.

Carbohydrates utilization
Reducing the concentration of rich undefined components in the medium revealed a clear difference between sucrose supplemented medium and a no-sugar control ( Figure 4A and Appendix Figure S3), further enabling to validate the assimilation of 14 different carbohydrates by M. florum. Upon comparison with model predictions, eight no-growth and four growth phenotypes were correctly predicted ( Figure 6A and Figure EV1). Two additional sugars, mannose and maltose, were found to be utilized by M. florum but had not been predicted by the model.
To recover those phenotypes, the alternate carbon metabolism of M. florum was studied, seeking enzymes that would likely carry a promiscuous activity. The three-dimensional structures reconstructed with I-TASSER were leveraged for that task ( Figure 1C and Dataset EV2). While the specificity of transporters could not be addressed with this method, downstream enzymes allowing the catabolism of mannose and maltose could be compared with the PDB using the FATCAT 2.0 server (Li et al, 2020). Specifically, the annotation of three enzymes (Mfl120, Mfl254 and Mfl499) involved in the assimilation of glucose and trehalose were considered. Using the same approach, the specificity of two aldolases (Mfl121 and Mfl639) were assessed to recover the expression phenotype of enzymes of the PPP ( Figure EV2 and Appendix Table S4).
The structural similarity between maltose and trehalose suggested they could use the same route into glycolysis. While the promiscuity of the transporter used to import maltose could not be tested in silico, it was hypothesized that the trehalose hydrolase (Mfl499) could also hydrolyze maltose.
To generate a 3D structure for Mfl499, I-TASSER used the Bacillus sp. α-glucosidase BspAG13_31A (PDB: 5zcc) as a template given the similarity of both sequences. This template structure was shown to have a high-specificity to α-(1-4)-glucosidic linkage (Auiewiriyanukul et al, 2018). The reconstructed structure was compared to the template used by I-TASSER with FATCAT 2.0 (p = 0.00, Figures 6B and EV2). The sequence and structural similarity with an enzyme capable of acting on both maltose and trehalose supports the hypothesis that Mfl499 is involved in maltose assimilation. The addition of both the promiscuous transport and cleavage reactions were sufficient to provide a growth prediction on maltose.
The capability of M. florum to metabolize mannose could be explained if the glucose-6-phosphate (G6P) isomerase, PGI, (Mfl254) was able to convert mannose-6-phosphate (M6P) into fructose-6-phosphate (F6P), hereby entering glycolysis. The reconstructed structure of the M. florum PGI was compared to that of Pyrobaculum aerophilum (Swan et al, 2004) (PDB:1TZB), known for its capability of converting either G6P or M6P into F6P. The structural similarity between these enzymes (p = 8.68e-12, Figure EV2) was consistent with this hypothesis and the model was modified accordingly.
The utilization of mannose was also studied in the context of glycan synthesis. Gas chromatography previously revealed the presence of both glucose and mannose in the CPS of M. florum (Matteau et al, 2020). The presence of a phosphomannomutase, PMM, (Mfl120) in the annotation suggested the conversion of M6P in mannose-1-phosphate (M1P), a necessary precursor for glycan synthesis (Bertin et al, 2015). The template used by I-TASSER for the reconstruction of the 3D structure of Mfl120 was the PMM/PGM structure from Pseudomonas aeruginosa (PDB:1K35). In this organism, the enzyme is necessary for the production of exopolysaccharides (Regni et al, 2002) with G6P and M6P entering the glycan synthesis pathway through the same enzyme. Given the structural similarity of the M. florum and P. aeruginosa enzymes ( Figure EV2), the promiscuous mutase reaction catalyzed by Mfl120 was added to the model and was sufficient to formulate a positive growth prediction for mannose.

Gene expression
Flux balance analysis enables the prediction of other phenotypes such as flux states and gene essentiality. These predictions can be used along with experimental data to improve the model quality (Thiele & Palsson, 2010). Genome-wide expression (Matteau et al, 2020) and transposon insertion (Tn-Seq) (Baby et al, 2018) datasets available for M. florum were used as a reference for the validation of model predictions. Gene expression was compared to the model predicted flux states by converting both datasets to binary "on" or "off" values. The set of expressed genes was defined by finding the thresholds that would provide the best match between transcriptomic and proteomic data while maximizing the number of expressed genes (Appendix Figure S8 and Dataset EV5). At the selected thresholds, 531 genes had a consistent signal in both proteomic and transcriptomic data while 145 had mixed signals (e.g., proteomic "on" and transcriptomic "off"). Only the genes for which datasets were consistent with each other were used for comparison with the model. This set contains 423 expressed and 108 silent genes (Appendix The flux state through the metabolic network was obtained by optimizing the production of biomass using parsimonious flux balance analysis (pFBA), a version of FBA that allows the generation of a unique flux state prediction through minimization of enzyme usage (Lewis et al, 2010). This method is best suited for the comparison of predicted fluxes to gene expression (Machado & Herrgård, 2014). A reaction flux was defined as active when the predicted value exceeded the numerical error (1e-8), and the flux was attributed to every gene that could catalyze the reaction via the gene-reaction rule. The comparison of binary flux predictions and observed expression was performed on the subset of model genes (173/208) for which proteomic and transcriptomic data showed no discrepancy.

Gene essentiality
Single gene essentiality was reported previously (Baby et al, 2018) where ~290 M. florum genes were proposed to be essential, which was considerably inferior to the 382/482 essential genes reported by Glass and colleagues in M. genitalium (Glass et al, 2006)

Comparing with model predictions
Comparing the model predictions with experimental data initially revealed erroneous predictions of the model that were manually addressed. True false negatives (TFN) were defined as genes simultaneously expressed and essential while no flux or essentiality was predicted. Eight TFN were identified. A single true false positive (TFP) was found, which had both flux and essentiality prediction but no observed expression nor essentiality. Curating these genes allowed increasing the model accuracy in the prediction of gene expression from 74.25% to 78.03% and essentiality from 74.52% to 76.92%. It is noteworthy that solving more than the TFP and TFN would mean fitting the model specifically to either essentiality of expression datasets.
The accuracy of the iJL208 model is slightly lower than other Mollicutes' models (M. genitalium (Suthers et al, 2009), 79% initial and 87% after GrowMatch  should also be addressed as it could not be explained in the current metabolic network. Finally, the necessity of the CTP synthase (Mfl648) suggested that the exchange of amino groups within M. florum occurred primarily through exchanges in the glutamine/glutamate pool rather than through the import of ammonium from the medium but insufficient information was found in the literature to support that hypothesis.
Solving false negatives required the addition of specific constraint(s). The dUTPase (Mfl257) and dihydroxyacetone kinase (Mfl229) are simple examples of such cases (Appendix Figure S9A). In the first case, the accidental production of dUTP by the cell was mimicked by forcing a flux through the PYK10 reaction which produces it from dUDP. In turn, this forces the activity of Mfl257 to produce dUMP and a pyrophosphate. The second case represents a similar cellular situation where the highly reactive molecule dihydroxyacetone phosphate spontaneously loses its phosphate yielding dihydroxyacetone, a toxic molecule for the cell. A forced flux through this spontaneous reaction resolved the discrepancy, making Mfl229 carrying flux and being essential (Appendix Figure S9B).
A more complicated case is observed for the ribulose-5-phosphate epimerase (Mfl223). Activating it in the model requires forcing flux through the PPP. In many Mollicutes, this pathway is incomplete (Breuer et al, 2019;Wodke et al, 2013;Suthers et al, 2009;Miles, 1992), often because no gene can be attributed to the transaldolase reaction (TALA). The structures of two aldolases (Mfl121 and Mfl639) were compared against the PDB 90 using FATCAT 2.0 (Li et al, 2020). For both structures, a significant similarity was observed against the A chain of the transaldolase of Thermotoga maritima (1vpxA; TM0295, Mfl121 p-value: 5.96e-10; Mfl639 pvalue: 2.51e-9).
Adding the TALA reaction enables flux through the PPP but does not force it since active uptake of ribose circumvents its need. The non-essentiality/non-expression of both ribose kinase (Mfl642, rbsK) and ribose ABC transporter (Mfl666, Mfl667, Mfl668, Mfl669) suggests that ribose was altogether absent from the ATCC 1161 growth medium in which the datasets were generated.
Ribose was therefore removed from the in silico medium, resulting in flux through the PPP and increased prediction accuracy for expression.

Curating True False Positive (TFP)
The path for synthesis of nicotinamide dinucleotide in M. florum was discussed above. The presence of a nicotinamidase (Mfl340) suggested the import of nicotinamide from the medium.
Nevertheless, experimental data revealed that this enzyme is both non-essential and nonexpressed, suggesting that the downstream metabolite, nicotinate, may be imported instead.
Adding this metabolite to the in silico media as well as an import reaction avoids the need for Mfl340, recapitulating experimental observations (Appendix Figure S9C).

Varying the growth rate results in different genome reduction scenarios
Together with experimental gene essentiality and the transcription unit architecture, iJL208 was used to formulate a minimal genome prediction. Using the MinGenome algorithm (Wang & Maranas, 2018), genome reduction scenarios were generated at different growth rates (Appendix Figure S10). This was made possible because MinGenome attempts to find the largest possible deletion in the genome without breaking the established constraints. The constraints imposed include the impossibility to delete an essential or its associated promoter and ensuring the feasibility of the GEM. The model's objective and value are therefore fixed. If a gene deletion prevents the model from solving at this specific growth rate, then the deletion is not possible.
Varying the minimum growth rate imposed as a constraint on the optimization problem formulated with MinGenome enables the deletion of genes that could hamper the growth rate without being completely lethal. While an array of growth rates was tested, only three different genome reduction scenarios were obtained. The similarity of the resulting genomes to JCVI-syn3.0 were assessed for each growth rate constraint imposed. The genomes formulated with a lower minimum growth rate constraint were more similar to JCVI-syn3.0 (Appendix Figure S10). The final genome size of the lower growth rate constraint scenarios was also smaller than higher ones.
In all cases, no more genes could be deleted after 80 deletions. A minimal size was reached at 60% of the optimal growth rate, yielding a 562 kbp genome containing 563 genes and corresponding to a ~30% reduction from the initial M. florum L1 genome. In size, this minimal genome scenario lies between the JCVI-syn3.0 inspired (470 kbp, 409 genes) and the core genome of M. florum (644 kbp, 585 genes) suggested by Baby and colleagues (Baby et al, 2018).
While this model-driven prediction does not reduce the number of genes beyond that of JCVI-syn3.0, a reduction of 30% in genome size is a level that has been reached experimentally in different species. In fact, the B. subtilis genome was trimmed by 36% and yielded a functional genome (Reuß et al, 2017) with growth rates similar to the wild-type strain. To our knowledge, the smallest E. coli genome allowing robust growth is that of DGF-298 (Hirokawa et al, 2013). At 2.98 Mbp, this genome represents a 34.4% reduction compared to the original E. coli K-12 substr.
W3110 (Westphal et al, 2016).B. subtilis and E. coli reduced genome strains yielded robust cells with growth rates similar to their parental strain. This was not the case for JCVI-syn3.0, which was originally reported to have a lower growth rate and an altered morphology (Hutchison et al, 2016).
In order to produce a more robust and functional cell usable in the laboratory, 19 genes were added back into the original JCVI-syn3.0, generating a 493 genes bacterium called JCVI-syn3A (Breuer et al, 2019). Our current prediction of a minimal gene set is therefore 70 genes above JCVI-syn3A (563 genes).

Functional analysis of the reduced genome
The functional categories in which the deleted/retained proteins were analyzed using the KEGG ontology ( Figure 7C and Figure EV4). Interestingly, the largest portion of loci considered for deletion belonged to the unmapped category (81). Reducing the number of unknown components is a key argument justifying research efforts in minimal cells. Identifying these non-essential hypothetical proteins was therefore crucial for further experimental efforts to reduce the genome of M. florum.
Next, we compared the number of genes in each KEGG functional category identified as deletion targets to the genes kept in the reduced genome scenario. Interestingly, the main category affected by deletions were uncharacterized proteins ("Not mapped") with 81 proteins (~56% of all deleted proteins), and a small fraction of those (16) had homologs in JCVI-syn3.0 ( Figure 7CD and Figure EV4). With 191 proteins out of 535, the proportion of uncharacterized proteins retained in the M. florum reduced genome scenario (~36%) is also very similar to the reported proportion in JCVI-syn3.0 (149/438, 34%).
The second KEGG category with the highest number of deletions was "Metabolism", with 34 proteins removed ( Figure 7CD and Figure EV4). 155 proteins of this category remained in the reduced genome and 54 of those were not homologous to JCVI-syn3.0. Specifically, proteins deleted in M. florum but present in JCVI-syn3.0 were mostly found in the transport sub-category (Mfl019, Mfl234, Mfl533, Mfl534) and are annotated as ABC transporters. In accordance with our experimental data and iJL208, both the glutamine ABC transporter (Mfl019) and the phosphate ABC transporter (Mfl234) were not essential in M. florum since other routes exist for their import.
Mfl533 and Mfl534 were annotated as lipid A export proteins (msbA) but no evidence supports the presence of this metabolite in M. florum. Since the lipid module was amongst the least characterized and given its status in our prediction, these genes represent top priorities for further biochemical characterization. Comparing the deleted proteins with the remaining ones in iJL208 revealed that gene redundancy in the sucrose PTS importer allowed to keep this function in the reduced genome. Contrarily, the trehalose PTS was completely removed suggesting that alternate versions of a minimal gene set for M. florum may have different auxotrophies.
Finally, the "Genetic information processing" (GIP) category was the least affected by deletions and contained the highest number of proteins in the reduced genome scenario ( Figure 7CD and Figure EV4). This category also contains the highest proportion of proteins shared with JCVI-syn3.0 (~89%).
We provide here a detailed analysis of the composition of the reduced genome by functional category. Detailed information is available in Dataset EV6:

Metabolism:
The metabolism category is composed of 12 sub-categories, three of which exclusively contain genes that have homologs in JCVI-syn3.0: ATP synthase, amino acid metabolism, and secretion system. To evaluate the possibility for further reduction or potential alternative genome reduction scenarios, we detail the composition of the remaining nine metabolism sub-categories. Glycolysis and carbohydrate metabolism related proteins that were retained in the reduced genome but not shared with JCVI-syn3.0 are: all beta-glucosidases, both E1 subunits of the PDH, a sucrose-6-phosphate hydrolase, a fructokinase and a 1-phosphofructokinase.
Pentose phosphate metabolism related proteins that were retained in the reduced genome but not shared with JCVI-syn3.0 are: one of the two 2-deoxyribose-5-phosphate aldolase, one of the two pentose-5-phosphate epimerase, as well as the ribokinase.
The only Lipid metabolism related proteins retained in the reduced genome but not shared with JCVI-syn3.0 is the NAD-dependent-glycerol-3-phosphate dehydrogenase.

Genetic Information Processing:
The Genetic Information Processing general category contains 13 sub-categories, five of which contain exclusively genes that have homologs in JCVI-syn3.0: RNA polymerase, Translation factors, Sulfur relay system, Protein export, and tRNA loading and maturation. To evaluate the possibility for further reduction or potential alternative genome reduction, we detail the composition of the remaining eight Genetic Information Processing sub-categories. Since most of the proteins within this general category have homologs in JCVI-syn3.0, we only provide the detail of the discrepancies between the reduced M. florum genome and JCVI-syn3.0.

-Transcription factors
This sub-category contains five proteins, with none specific to M. florum, three having homologs in JCVI-syn3.0, and two homologs in JCVI-syn1.0. The two transcription factors that were retained in the reduced genome but not shared with JCVI-syn3.0 are: an unknown transcriptional regulator and a transcriptional repressor of the fructose operon, DeoR family (consistent with metabolism). -

Ribosome
The Ribosome contains 50 proteins, with a single one specific to M. florum, and the 49 others having homologs in JCVI-syn3.0. The single ribosomal protein that is retained in the reduced genome but not shared with JCVI-syn3.0 is the 50S ribosomal protein L33.
Interestingly, this protein was also absent from JCVI-syn1.0 but was introduced when generating JCVI-syn3A.

-Nucleases:
This sub-category contains eight proteins, with none specific to M. florum, seven having homologs in JCVI-syn3.0, and one homolog in JCVI-syn1.0. The single protein that was retained in the reduced genome but not shared with JCVI-syn3.0 is the Mg2+ dependent DNAse.

-Chaperones
This sub-category contains five proteins, with none specific to M. florum, three having homologs in JCVI-syn3.0, and two homologs in JCVI-syn1.0. The two proteins that were retained in the reduced genome but not shared with JCVI-syn3.0 are the hsp33 redoxregulated chaperone and the cell division trigger factor (EC 5.2.1.8).

-Peptidases
This sub-category contains five proteins, with two specific to M. florum, three having homologs in JCVI-syn3.0, and no homologs in JCVI-syn1.0. The two proteins that were retained in the reduced genome but not shared with JCVI-syn3.0 are the intramembrane