Protein constraints in genome‐scale metabolic models: Data integration, parameter estimation, and prediction of metabolic phenotypes

Genome‐scale metabolic models provide a valuable resource to study metabolism and cell physiology. These models are employed with approaches from the constraint‐based modeling framework to predict metabolic and physiological phenotypes. The prediction performance of genome‐scale metabolic models can be improved by including protein constraints. The resulting protein‐constrained models consider data on turnover numbers (kcat) and facilitate the integration of protein abundances. In this systematic review, we present and discuss the current state‐of‐the‐art regarding the estimation of kinetic parameters used in protein‐constrained models. We also highlight how data‐driven and constraint‐based approaches can aid the estimation of turnover numbers and their usage in improving predictions of cellular phenotypes. Finally, we identify standing challenges in protein‐constrained metabolic models and provide a perspective regarding future approaches to improve the predictive performance.

metabolic flux based on two assumptions: (i) the network is at steadystate, whereby there is no change in the concentration of intracellular metabolites and (ii) cells evolved to optimize a metabolic objective, such as growth or production of an important metabolite (Gianchandani et al., 2010).The first assumption results in an underdetermined system of linear equalities arising from mass balance equations, with fluxes as unknowns.By imposing constraints that represent cellular growth conditions, the solution space can be restricted, resulting in more biologically relevant predictions.These constraints include the upper and lower bounds for metabolic fluxes, (ir)reversibility of reactions and exchanges with the environment.The second assumption further carves a part of the feasible space, resulting from the application of the constraints, and narrows down the physiologically relevant solutions.
While conventional GEMs provide a useful framework for studying metabolism on a large scale, there are still many phenotypes that elude their predictive capabilities.More specifically, metabolic shifts that occur at higher growth rates cannot be accurately simulated by conventional GEMs, with notable examples including the overflow metabolism in bacteria (Basan et al., 2015), the Crabtree effect in Saccharomyces cerevisiae (Sánchez et al., 2017), and the Warburg effect in human cancer cells (Shlomi et al., 2011) (see Section 3.1 for detailed elaboration).Phenotype predictions of GEMs can be improved by integrating protein constraints related to enzyme kinetic parameters and enzyme concentrations.The turnover number, k cat , of an enzyme is a key kinetic parameter that refers to a first-order rate constant with the unit of s −1 that describes the conversion of a substrate to product per unit of time, as accelerated by the enzyme.However, despite the advantages and advances in integrating enzyme parameters in GEMs, obtaining k cat values remains challenging.Measurements of the kcatome, a subset of the kinetome that includes the turnover numbers of all enzymes, depend on the purification of specific enzymes, which often is difficult (Nilsson et al., 2017).Furthermore, there is a lack of knowledge of the cofactors and coenzymes required for enzymatic function, which hinders in vitro measurements of k cat values (Davidi et al., 2016).
Even when in vitro measurements are available, the usage of the resulting values in protein-constrained GEMs (pcGEMs) is challenging since the kinetic data is obtained from nonphysiological conditions, which becomes a source for a discrepancy with real physiological conditions (Chen & Nielsen, 2021).
Enzyme concentrations are the additional constraint added to GEMs along with turnover numbers.They are derived from absolute protein abundances, usually obtained from quantitative proteomics experiments (Lahtvee et al., 2017;Sánchez et al., 2017).However, like estimates of turnover numbers, absolute proteomics measurements are still difficult to obtain.There are several challenges, including (i) a large portion of proteins is still undetected due to limitations of current mass spectrometry techniques (Pappireddi et al., 2019), (ii) the ionization efficiency is heavily affected by the protein's physicochemical properties (Otto et al., 2014), (iii) high cost of equipment and reagents (Swiatly et al., 2018), and (iv) lack of a standardized approach for measuring absolute abundances (Calderón-Celis et al., 2018).Methods for absolute protein quantification, such as isobaric tagging, stable isotope labeling, and others have been reviewed by (Lindemann et al., 2017).In addition, the reproducibility of absolute protein quantification in different samples is often inconsistent (Millán-Oropeza et al., 2022).In nonmodel species, proteomics studies are further complicated by the lack of physiological and genomics information, especially annotated genomes (Heck & Neely, 2020).Nevertheless, there have been many efforts in using computational approaches to estimate k cat values and protein abundance, which is explored in the further sections of this review.
This review focuses on pcGEMs, the estimation of the parameters they comprise, and their usage in predicting phenotypes (Figure 1).First, we present the current methods of integration of k cat values in GEMs, providing a summary of approaches that use pcGEMs, and the phenotypes that can be determined by having access to k cat and enzyme abundance values.Next, we discuss the approaches for estimation and correction of k cat values.We then examine approaches for the prediction of protein abundance and their usage in pcGEMs.Finally, we identify and discuss current challenges in parameter estimation and their integration in pcGEMs, and provide a perspective regarding future approaches to improve the predictive performance values.

| PROTEIN-CONSTRAINED GENOME-SCALE METABOLIC MODELS
The shortcomings of conventional GEMs have propelled the development of approaches to improve their prediction capabilities (Ye et al., 2022).The development and usage of pcGEMs represent one means to overcome these limitations by considering the properties of enzymes that determine reaction fluxes.These pcGEMs inherit all constraints from conventional GEMs, such that where S is the stoichiometric matrix and v is the flux distribution vector; and where v min and v max are the lower and upper bounds for metabolic flux, respectively.This allows for classic simulation approaches, like FBA (Orth et al., 2010), to also be applicable with pcGEMs.A key difference is the inclusion of the constraint, common to most approaches: where v j is the metabolic flux of the reaction j, E [ ] i is the internal concentration of enzyme i, and k ij cat is the enzyme turnover number of reaction j catalyzed by an enzyme i (Adadi et al., 2012;Sánchez et al., 2017).
The constraints applied to pcGEMs are straightforward when considering reactions catalyzed by a single enzyme.There are, however, other associations formulated in GPR rules, such as isozymes and enzyme complexes; in addition, a gene can participate in multiple GPR rules due to enzyme promiscuity (Amin et al., 2019).
Enzyme complexes can be divided into homo-or heteromeric.
Homomeric enzymes are complexes composed of identical subunits, and enzymatic constraints can directly be defined for such complexes.Heteromeric enzymes, however, are complexes composed of different protein subunits encoded by different genes.This makes the usage of k cat values challenging, since it can become unclear which subunits contain active sites (Davidi et al., 2016).
Besides pcGEMs, other notable developments to improve conventional GEMs beyond protein allocation are to consider the whole machinery involved in protein biosynthesis and other cellular processes.These come in two flavors-models of metabolism and macromolecular expression (ME models) and models of resource balance analysis (RBA) (Figure 2).In the former, the processes of F I G U R E 1 Timelines for approaches related to protein-constrained GEMs.Three timelines are considered regarding the following problems addressed in protein-constrained GEMs: (i) integration of turnover numbers in GEMs, (ii) estimation of turnover numbers, and (iii) prediction of protein abundances.FBAwMC, flux balance analysis with molecular crowding; GEMs, genome-scale metabolic models; IOMA, integrative omics metabolic analysis; MOMENT, MetabOlic Modeling with Enzyme kineTics.
F I G U R E 2 Relationship between different types of GEMs and approaches for their analyses.As pcGEMs improve on conventional GEMs, they embed the constraints from FBA.The ME and RBA models, likewise, envelop pcGEMs by also considering enzyme catalytic rates and enzyme abundances, alongside the additional protein biosynthesis machinery.MFA denotes approaches that rely on stoichiometry in addition to atom mappings to estimate fluxes based on data on labeling patterns.FBA, flux balance analysis; GEMs, genome-scale metabolic models; ME, macromolecular expression; MFA, metabolic flux analysis; pcGEMs, protein-constrained genome-scale metabolic models; RBA, resource balance analysis; SSR, sum of squared residuals.(Lloyd et al., 2018).In the latter, macromolecular processes such as secretion and protein folding through chaperones are also taken into account along with protein biosynthesis (Goelzer et al., 2015).These approaches have been recently reviewed by De Becker et al. (2022); Kerkhoven (2022);and Regueira et al. (2021).
The specific way in which enzyme constraints are encoded varies with each approach.Two groups of approaches can be identified based on whether turnover numbers and enzyme mass balance are explicitly represented in the stoichiometric matrix.The first group comprises FBAwMC, MOMENT, eMOMENT, and ECMpy, which do not change the original stoichiometric matrix, while the second includes GECKO, sMOMENT, PAM, and OVERLAY, which expand the stoichiometric matrix in encoding the protein constraints.The definition and specifics of each approach are discussed in detail in Section 3. If proteomics measurements are available, they can be used to specify the concentration of enzymes E [ ] i .In the case when such measurements are not available, a common assumption is that enzyme usage is limited according to the total enzyme pool E pool , derived from the total protein content, denoting the total usage of all metabolic enzymes integrated in the model (Bekiaris & Klamt, 2020;Sánchez et al., 2017).Another feature of pcGEMs is the splitting of each reversible reaction into two irreversible forward and backward reactions.This way, only positive flux values are calculated, while also making it possible to associate different k cat values for each reaction, since substrate affinity could be different (Adadi et al., 2012;Davidi et al., 2016).

| INTEGRATION OF TURNOVER NUMBERS IN GEMS
Over the years, many methods have been developed to integrate enzymatic information in GEMs.The first method to consider enzyme parameters was FBA with molecular crowding (FBAwMC), which uses information about the solvent capacity for attainable enzyme concentrations inside the cytoplasm (Beg et al., 2007).The optimization problem solved by FBAwMC is to maximize the growth rate subject to FBA constraints plus limitations in the crowding coefficient a.This is attained by constraints that involve the cell volume V such that where v i is the molar volume of enzyme i, and n i is the level of enzymes.This equation is reformulated by dividing by the cell mass M, which gives that , the enzyme concentrations; and C M V = / , the cytoplasmic density.By considering that the flux distribution v i is defined as where b xk = i i i , with x i being the concentration of substrates, products, activators, and inhibitors associated with the reaction i; and k i being the k cat for an enzyme i; the constraint applied to simulations is then defined as Another method, termed IOMA (integrative omics metabolic analysis), considered enzyme turnover numbers along with proteomics and metabolomics data (Yizhak et al., 2010).IOMA relies on a Michaelis-Menten-like rate equation to estimate flux distributions.It considers relative protein levels and enzyme kinetics information such as k cat values and V max .In this way, the following constraint is added to the classic FBA constraints: where e is the concentration of enzymes, e ref is the concentration of enzymes at a reference condition, a + and a − the saturation values for enzymes in forward and reverse reactions, respectively, and V max + and V max − the maximal flux for forward and reverse reactions, respectively.
The saturation values for enzymes are calculated as where s i and pi are, respectively, the concentrations of substrates and products i, and k s , m i and k p , m i are the dissociation constants for substrates and products, respectively.These previously described methods provided an improvement over FBA, but as pointed out by Adadi et al. (2012), they depended on utilizing experimentally determined uptake rates to predict phenotypes across different conditions.To address this issue, Adadi et al. (2012) developed MetabOlic Modeling with Enzyme kineTics (MOMENT), which expands the FBAwMC approach by taking into account the maximal cellular capacity of enzymes.The MOMENT approach also improved the handling of isozymes and enzyme complexes.A reaction catalyzed by a single enzyme is . However, for reactions that can be catalyzed by two enzymes a or b, the equation changes to For reactions catalyzed by enzyme complexes, the formulation changes to Similar to FBAwMC, MOMENT also uses a constraint on the enzyme solvent capacity: where MW i is the molecular weight of protein i, and C is the total weight of proteins.
Improvements and derivations of the MOMENT approach have been developed recently.Bekiaris and Klamt (2020) have extended and simplified MOMENT, resulting in sMOMENT (short MOMENT).
In this approach, the same constraints of MOMENT are used, but resulting in a model with significantly fewer variables while yielding the same results as MOMENT.This is achieved by reformulating the enzyme solvent capacity constraint, such that where P is the threshold (g g / DW ) of enzymes covered by the pcGEM.
This equation is further reformulated as where v pool represents the mass of all enzymes in the pcGEM needed to catalyze the reactions in the model.The sMOMENT approach also changes how enzyme complexes are used by the model.By considering the enzyme costs c i of a reaction i, such that then for all enzymes catalyzing reaction i, or all subunits of an enzyme complex, the minimum value is used: Another extension to MOMENT is the approach termed eMOMENT, which introduces enzyme promiscuity as a new constraint to the base MOMENT formulation (Wendering & Nikoloski, 2022).This constraint is defined as where G is the set of genes in the model, E k is the abundance of enzyme k, and r is the set of reactions in the model.
An approach inspired by the reformulations and reduction of model complexity introduced by sMOMENT has been developed by Mao et al. (2022), termed ECMpy.This approach introduces enzyme constraints in the model without adding new reactions or explicitly accounting for enzymes in the stoichiometric matrix.Instead, it uses a single constraint, defined as where σ is the saturation coefficient of the enzyme i, p tot is the total protein content in the model, and f is the mass fraction of enzymes.
An approach similar to MOMENT, termed GEM with Enzymatic Constraints using Kinetic and Omics data (GECKO), also integrates k cat values to limit metabolic flux according to its maximum capacity (Sánchez et al., 2017).GECKO also limits reactions according to the abundance of each enzyme present in the model, given that absolute proteomics measurements are available.The GECKO approach uses the same constraint as MOMENT for reactions catalyzed by single . It differs in its handling of isozymes, promiscuous enzymes (not considered in MOMENT), and enzyme complexes.For isozymes, the flux is constrained as In this scenario, the reaction is split into as many different reactions as there are isozymes capable of catalyzing that reaction, each with only one enzyme.Then, an intermediate reaction termed "arm reaction" is added to keep the original upper bound of the reaction, using a pseudometabolite representing an intermediate state between a substrate and a product.This allows for the enzyme of each reaction to be assigned a different k cat value, as substrate affinity could be different.For reactions catalyzed by promiscuous enzymes: This arrangement also allows considering different k cat values for each reaction catalyzed by the same enzyme.In addition, these reactions share the same upper bound of enzyme availability.Finally, for reactions catalyzed by enzyme complexes, the stoichiometry s ik of the enzyme subunit U ik and its concentration U [ ] ik are considered: The GECKO approach has also been improved upon.A method developed by Alter et al. (2021), termed the protein allocation model (PAM), uses the base formulation of GECKO for integrating k cat values and proteomics data and reimplements the idea of protein allocation sectors as developed by Mori et al. (2016) in the approach termed constraint allocation FBA (CAFBA).In CAFBA, the total enzyme usage of the model is divided into four sectors: ribosomal proteins, biosynthetic proteins, uptake and transport proteins, and housekeeping proteins.The PAM approach instead divides the total enzyme usage into three sections: translational proteins, which are enzymes related to protein biosynthesis and are directly associated with the growth rate; unused enzymes, defined as those enzymes that exist in an overabundance state, where more proteins were produced than it was necessary for the cell during a certain FERREIRA ET AL.
| 919 physiological state and reduce the growth rate; and active enzymes, those actively involved in catalyzing reactions in metabolism.The translational protein sector Φ T , is defined as where Φ T,0 is the measure of translational enzyme concentration at zero growth, w T is the maximum ribosomal elongation rate, and μ is the growth rate.Regarding the unused enzyme sector Φ UE , it is expressed as where Φ UE,0 is the measure of unused enzyme concentration at zero growth, w UE is the measure of the increase in enzyme usage efficiency, and v s is the substrate uptake rate.With respect to the active enzyme sector Φ AE , it has to account for enzyme mass balances and k cat values of all enzymes that catalyze metabolic reactions.It is defined as where v e is the flux of a reaction catalyzed by the enzyme e, and M e is the molar mass of the enzyme.Thus, the total enzyme mass concentration of the model can be described as the sum of all protein sections: An approach, termed OVERLAY, proposes a different formulation (Yao et al., 2023).It integrates catalytic rates in the form of the effective turnover rate, k eff .In contrast to other approaches, it considers enzyme complexes separately from other enzymes, treating complexes as a single entity in the model.Further, for each reaction catalyzed by an enzyme, OVERLAY adds a pair of forward and reverse enzymes to account for reversible reactions, while spontaneous reactions are ignored.In terms of model constraints, the k eff values are used to define the lower and upper bounds of a reaction v, where k eff avg is defined as a basal value of k eff (assumed to be 65 s −1 ), e rev is enzyme concentration for the reverse reaction, and e for is the enzyme concentration for the forward reaction.
The described approaches vary widely in how they are implemented, how much of the reconstruction steps are automated, and how easy is to peruse their documentation.Focusing on the approaches that are available in public repositories (e.g., GitHub), the sMOMENT approach is bundled with a workflow termed AutoPAC-MEN, a Python package that allows for the automated reconstruction of pcGEMs.A step-by-step tutorial is included in the supplementary information of the original manuscript, with a more detailed documentation provided in the Python package manual.The ECMpy approach is also implemented in Python, with its main functions contained in a single script.A step-by-step guide is available in the form of Jupyter notebooks, which reproduces the reconstruction of eciML1515 as performed in its manuscript.The GECKO approach, on the other hand, is available as a MATLAB package.It supports the reconstruction and refinement of pcGEMs, and the integration of proteomics measurements.The documentation for the GECKO approach is not extensive, but some information is included as comments on its main functions.The GECKO approach, while mainly developed for MATLAB, also includes a Python package to integrate protein abundances and to allow interface with cobrapy (Ebrahim et al., 2013).It is important to highlight that since sMOMENT, ECMpy, and GECKO follow similar formulations, they are able to generate very similar models with equivalent phenotype prediction capabilities.GECKO models, however, are notoriously more complex than sMOMENT or ECMpy models, since GECKO explicitly introduces enzyme usage pseudoreactions and pseudometabolites to consider enzyme constraints.The PAM approach generates models similar to GECKO models, but with the addition of proteome sectors.
It is also available as a MATLAB package.It currently lacks a detailed documentation on how to execute the tool, but an example code is provided.Finally, the OVERLAY approach is developed for MATLAB This phenomenon is also known as the Crabtree effect and the Warburg effect in the context of yeast and human cancer cells, respectively, and occurs when the carbon source availability exceeds the capability of an organism in assimilating it (Li, Nees et al., 2022).

| Phenotypes that can be determined having access to k cat values
A metabolic phenomenon often missed by conventional GEMs is diauxic growth.In a setting where multiple carbon sources are available, conventional FBA predicts simultaneous uptake and usage of all carbon sources, which is biologically unrealistic.With the FBAwMC approach, the Escherichia coli model MG1655 was simulated in a condition where five different carbon sources were available: glucose, galactose, maltose, glycerol, and lactate.In this setup, the sequence of substrate uptake and consumption matched experimental data, with glucose being used first and exclusively, followed by galactose, lactate, maltose, and glycerol (Beg et al., 2007).
The Crabtree effect in S. cerevisiae was captured in the pcGEM ecYeast7 by simulating a glucose-limited chemostat with increasing growth rates.As the growth rate increased, there was a linear increase in the uptake of glucose, O 2 , and production of CO 2 .At a growth rate of 0.3 h −1 , the uptake of glucose and production of ethanol sharply spiked, while O 2 consumption sharply decreased.
However, the conventional GEM Yeast7 still predicted a linear increase in the uptake of glucose, O 2 , and production of CO 2 (see fig. 3A from Sánchez et al., 2017).In terms of pathway usage, the pcGEM predicted an increase in metabolic flux through glycolysis, while the flux through oxidative phosphorylation decreased (Sánchez et al., 2017).
Using the GECKO approach and dynamic FBA (dFBA) (Mahadevan et al., 2002) The maximum rate of a reaction, V max , can be determined by knowing the catalytic rate of the reaction when the enzyme is at its point of saturation and how much of that enzyme is present, given the relationship: As this represents the maximum rate, the effects of metabolites can only result in metabolic fluxes lower than V max .This is accounted by considering a function η of that captures the effect of metabolite concentrations and different parameters (e.g., equilibrium and Michaelis-Menten constants, K eq and K m ).For an environmental condition C, the function η satisfies the expression: allowing for the metabolic flux to be expressed as i are available, it is possible to calculate the apparent catalytic rate k app by rearranging , or by considering the relationship between k cat and η: which then leads to This relation allows us to derive one of the three quantities if the other two are available.It was first used by Valgepea et al. (2013) to calculate k app values by using quantifications of the absolute proteome and metabolic flux analysis of E. coli cultivated in increasing growth rates.They calculated k app values for 191 enzymes and found that as the growth rate increases, there is a 3.7-fold increase in k app values, which is discussed as a possible mechanism in which metabolic flux increases alongside growth rates.

| Estimation of catalytic rates
The estimation of in vivo catalytic rates by Davidi et al. (2016) builds on the approach by Valgepea et al. (2013) and laid much of the groundwork in estimating the kcatome that following approaches later employed, by exploring the relationship It was also demonstrated by Davidi et al. (2016)   The next development on the estimation of catalytic rates after NIDLE-flux is a novel constraint-based approach to estimate catalytic rates developed by Wilken et al. (2022).The approach relies on an objective function that minimizes the error function of predicted and measured fluxes and enzyme concentrations, such that • m e t a b o l i c r e a c t i o n s, where L is the error function, v i is the flux through reaction i, and e j is the concentration of the enzyme j.This optimization problem allows for the prediction of both metabolic fluxes and enzyme concentrations (also unknown variables in the function L k ( ) cat , albeit not represented in the manuscript along with molecular weights of proteins), which given the relationship between these variables , it is possible to calculate the k cat value that best fit the experimental data.Wilken et al. (2022) tested this optimization problem using experimental data from Heckmann et al. (2020), predicting metabolic fluxes and protein concentrations for a diverse range of growth conditions.They found that using k cat values estimated from this approach improves the accuracy of model predictions against experimental data by 35 ± 2%.However, the approach in the present formulation includes quadratic constraints ), which deserves further exploration of identifying the global optimum of the optimized function.Another datadriven method is the deep learning approach DLKcat, developed by Li, Yuan et al. (2022).This method uses a combination of graph neural network, taking as input features derived from substrates in the SMILES format; and a convolutional neural network, using amino acid sequences as inputs.The networks also considered the substrate name, EC number, organism name, and k cat values.The DLKcat method thus requires significantly fewer inputs than the model developed by Heckmann et al. (2018).It was developed using the Python package PyTorch (Paszke et al., 2019).DLKcat was able to predict k cat values of all enzymes of 343 fungal species, all of which were used for the reconstruction of pcGEMs.A sequential Monte-Carlo-based approximate Bayesian computation was also used to correct in vivo k cat estimations when these were significantly different from in vitro k cat values.For assessing the predictions, RMSE values were calculated between experimental and predicted growth in S. cerevisiae and Yarrowia lipolytica, achieving lower values in each generation in the Bayesian approach training process and outperforming the original pcGEMs.The predictions obtained using DLKcat were made available to the public in an extensive database named GotEnzymes (Li et al., 2023), available at https:// metabolicatlas.org/gotenzymes,containing over 25.7 million pairs of enzymes and substrates for over 8000 organisms.DLKcat itself is available in a GitHub repository and contains a step-by-step guide for the user to install the dependencies and run the trained models in a command line script.
While the DLKcat method proposes a powerful approach to predict k cat values, some shortcomings were identified by other works; for instance, DLKcat accurately predicts only k cat values of enzymes similar to those used in the training data set, with decreasing accuracy for enzymes with more dissimilar amino acid sequences to those found in the training data (Kroll et al., 2023).This On this point, Kroll et al. (2023) proposed a new method, termed TurNuP, to predict k cat values using a modified and retrained transformer network.The model was trained using amino acid sequences, substrate and product IDs, reaction equations, and k cat measurements, all collected from BRENDA, UniProt, and Sabio-RK.
The information was transformed into binary molecular fingerprints for each substrate and each product, to integrate the data.TurNuP was able to predict k cat values with good agreement to experimental data, achieving a Pearson correlation of 0.67.For unseen reactions, which were not present in the original data set, the performance was still good, achieving a Pearson correlation of 0.60.Kroll et al. (2023) also evaluated how sequence similarity between enzymes in the training and test data sets affects model performance.For enzymes with high sequence similarity (99%-100%), the model achieves an R 2 score of 0.67, while for enzymes with low sequence similarity (0%-40%), the R 2 score was 0.33.As done with previous tools, TurNuP was developed using PyTorch.It is available on a GitHub repository, along with all data sets deposited on Zenodo.Furthermore, TurNuP is available on a web server (https://turnup.cs.hhu.de/),requiring no previous setup.It is important to highlight, however, that the described approaches were mostly focused on estimating k max vivo values for enzymes participating in reactions with simple GPR rules, for example, a single enzyme catalyzing a single reaction.For isozymes, however, it is difficult to determine k max vivo values as the kinetics of each enzyme might differ.This is tackled by (Davidi et al., 2016) by treating the isozymes as one single enzyme, using a sum of molecular weights of each isozyme to find the molecular weight of the lumped single enzyme.In contrast, Xu et al. (2021) use the maximum abundance for isozymes and the minimum abundance for enzyme complexes.
Meanwhile, Heckmann et al. (2018) do not even consider complex GPR rules in their approach.This challenge is more pronounced for enzyme complexes, as to date few studies have described a possible way to assess their enzyme kinetics, with many focusing only on homomeric complexes.Homomeric complexes can be treated as a single element and thus assume only one catalytic rate value, but at the cost of forfeiting the specific mapping of enzyme data to proteins and their reactions.For heteromeric enzymes, Davidi et al. (2016) have proposed using the specific activity (SA) of the heteromeric complex instead of the k cat value.The SA is defined as the amount of product (in molar quantity) that is formed in a reaction per weight of enzyme per unit of time.It is calculated as

| Correction of catalytic rates
The integration of k cat and k app values obtained from either biochemical assays or computational estimations can often lead to overconstrained models.This happens due to the inherent errors and uncertainties of these measurements.In the first iteration of the GECKO approach, k cat values retrieved from the BRENDA database were manually curated to ensure that the model was able to generate feasible solutions (Sánchez et al., 2017).The second iteration of GECKO (Domenzain et al., 2022) introduced a heuristic for the correction of k cat values that is based the on control coefficient of an enzyme (ECC), defined as where v obj is the solution for a given objective function, k Δ ij cat is a perturbation of the k cat values induced by an increase of its initial value by 10-fold, and v Δ obj is the change in v Δ obj given the change of k cat .The ECCs are then used to rank the enzymes in the pcGEM in decreasing order, with the first enzyme in the list being selected to have its k cat value changed to the maximum k cat value that exists in BRENDA.This operation iterates until the pcGEM can achieve the experimental growth rate provided when reconstructing the pcGEM.
This k cat correction heuristic was assessed by reconstructing a new version of the ecYeast7 model using GECKO 2.0 and comparing it to its first version, which was reconstructed with GECKO 1.0.
Simulations performed with the model reconstructed with GECKO 2.0 had a lower average relative error than the model reconstructed with GECKO 1.0 when compared to experimental data in 19 different conditions, these being 23.97% of average relative error and 32.07% of average relative error, respectively.
An algorithm for correcting k cat values has been proposed by Wendering et al. (2022), named PRESTO (protein-abundance-based correction of turnover numbers), which leverages measurements of protein abundance and exchange fluxes over multiple conditions to correct k cat values.Instead of control coefficients, PRESTO uses a linear optimization approach.A correction factor δ is added to the k cat value of each enzyme i, given that protein abundance values are also available: The objective of the optimization problem is to minimize a weighted linear combination of the average relative error for predicted specific growth rates and the correction of the initial turnover number integrated in the pcGEM: where ω is the average relative error between experimental and predicted growth rates in an experimental condition C and λ is a fitted parameter that controls the trade-off between the minimization objectives.The problem is then defined as To validate the approach, the corrected k cat values were integrated in the S. cerevisiae GEM Yeast8 and the E. coli GEM iML1515.The resulting pcGEMs were then used to simulate phenotypes in three conditions: (i) where only the total protein content of a certain growth condition was available, (ii) where uptake constraints were considered alongside total protein content, and (iii) measured protein abundances were also integrated.The models integrated with PRESTO-corrected k cat values displayed lower relative errors for the three simulation conditions than the models integrated with GECKO-corrected k cat values using the pcGEMs ecYeast8 and eciML1515 generated using GECKO 2.0.These findings highlighted that using physiological data and enzyme abundance can enhance estimations of k cat values.The obtained estimations and corrections of catalytic rates can already greatly enhance the predictive capabilities of pcGEMs.The support of some approaches (e.g., GECKO) for the integration of proteomics data can further improve the predictive capabilities of pcGEMs.

| APPROACHES FOR PREDICTION OF PROTEIN ABUNDANCE
Absolute protein abundance allows for the integration of constraints from proteomics data in pcGEMs (Adadi et al., 2012;Sánchez et al., 2017).These data are usually derived from mass spectrometry using methods such as spectral counting or peptide intensity-based quantification (Lindemann et al., 2017).However, given the limitations of these methods, many computational approaches have been developed to predict absolute protein abundance from data that are more facile to gather.Many of these approaches are based on transcriptomics data, sequence-derived data, physicochemical data, or a combination of all.These approaches are usually developed using statistical or supervised learning models, but there are also approaches that stem from the constraint-based modeling framework.

| Predictions using data-driven models
One of the earliest attempts at a data-driven model to predict protein abundance was the work of Nie et al. (2006), who developed a zeroinflated Poisson regression model (ZIP) integrating microarray and relative protein abundance data.The data was obtained experimentally by growing Desulfovibrio vulgaris on lactate or formate as carbon sources.To build the model, it was assumed that protein abundance y follows a Poisson regression distribution with probability p 1 − and a mean λ, and is dependent on messenger RNA (mRNA) abundance x.
The model then is defined as: where δ is 1 if y = 1, or 0 if y ≠ 0. To evaluate the model, they calculated the coefficient of variation (CV) of 30 ribosomal proteins and seven subunits of ATP synthase and compared it to the CV values of the entire D. vulgaris set of proteins.
While the ZIP model could generate predictions with lower CV values for some proteins related to central metabolic pathways or certain operons (Nie et al., 2006), the model relied solely on mRNA and protein data, assuming a linear relationship between the two, which limits its predictive capabilities.Torres-García et al. (2009) proposed instead a nonlinear approach, using gradient boosted trees (GBT) and a data set composed of the microarray and proteomics data from (Nie et al., 2006), numerical sequence-derived data such as protein length, molecular weight, GC content, and codon composition, along with categorical data containing the functional category of each protein.Model performance was assessed by the coefficient of determination that ranged from 0.393 to 0.582 and the coefficient of variation, which was smaller than that from the model of Nie et al. (2006).
Although the GBT model provided an improvement over the ZIP model, the coefficient of determination was still low.To address this problem, Li et al. (2011) developed a model using a multilayer perceptron (MLP) using the data set constructed by Torres-García et al. (2009).The resulting MLP is a feed-forward network, using a hyperbolic tangent function as the activating function.It has an input layer that uses the transcriptomics data, a single hidden layer with six to nine neurons (depending on the data set), and an output layer that would yield the protein abundance values.The coefficient of determination of the trained model ranged from 0.47 to 0.69, representing an improvement over the previous GBT model attempt.
Another data-driven model to predict protein abundances is the Bayesian network constructed by Mehdi et al. (2014) that combined data, consisting of mRNA expression levels, transfer RNA adaptation index, protein and mRNA half-life, mRNA folding energy, and mRNA interactions with RNA-binding proteins, from S. cerevisiae and S. pombe.Model performance was assessed by the Spearman's correlation that ranged from 0.61 to 0.77.Joint learning approaches have also been used to predict protein abundances.Li et al. (2019) developed an integrated approach to predict protein abundance in breast and ovarian cancer cells.The devised approach contains three parts.First, a generic model learns the relationship between mRNA expression and protein abundance.
Then, a series of protein-specific random forest models are used to learn how individual genes behave in a network.Finally, a crosssample model uses combined data of the two cancer cells and is likewise trained as the protein-specific random forest models.An ensemble of the three parts was then trained, using the weighted average predictions from each part to predict protein abundance.This approach outperformed other approaches in the NCI-CPTAC DREAM Proteogenomics Challenge, and the predictions achieved an average Pearson correlation of 0.53.
Besides mRNA expression, mRNA secondary structures have also been used to predict protein abundances.Terai and Asai (2020) developed RBSeval, an approach trained with three different algorithms to predict protein abundance in E. coli, using features such as accessibility around the Shine-Dalgarno sequence, minimum free energy of the mRNA molecule, Viterbi score, and inside-outside score, those being calculated either by the Turner model or the CONTRAfold model.The model was assessed by Spearman's correlation, which ranged from 0.554 to 0.709.
The data-driven approaches so far have been heavily reliant on experimental data for training the models, which can become troublesome if these approaches are to be applied to organisms different from those used in the original manuscripts.In this regard, Ferreira et al. (2021) trained an AdaBoost regression model to predict protein abundances using codon usage metrics as features.The model was trained on S. cerevisiae proteomics data, with predictions achieving a Spearman's correlation of 0.744 when compared to experimental data.The model was then used to predict protein abundance for E. coli, S. pombe, and Kluyveromyces marxianus, achieving Spearman's correlations of 0.503, 0.702, and 0.623, respectively.Predictions from S. cerevisiae were also assessed by integrating the predicted protein abundances in the ecYeast8 pcGEM, which yielded metabolic flux simulations in agreement with simulations performed using experimental proteomics data.
While the approach from Ferreira et al. (2021) and previous models achieved good predictions, they still relied on data from optimal growth conditions.In addition, given that the proteome is remodeled when physiological and/or environmental changes occur, machine learning models that rely on static features (such as codon usage, macromolecular function and structure, and housekeeping gene expression) cannot predict the dynamic nature of the proteome.Furthermore, current constraint-based methods underestimate predictions of protein abundance (see Sections 5.2 and 6.1).To address these issues, an integrated framework of constraint-based modeling with machine learning, named CAMEL (Coupled Approach of MEtabolic modeling and machine Learning), was recently developed (Moura Ferreira et al., 2023).The constraint-based module of CAMEL predicts the enzyme usage distribution and the flux distribution under a certain growth condition.The predicted enzyme usage distribution is employed along with experimental proteomics measurements to calculate the protein reserve ratio, which is the discrepancy between measured and predicted protein abundances.The protein reserve ratio is then used to train machine learning models using the treebased pipeline optimization tool automated tool (Olson & Moore, 2019), using as features the enzyme usage and flux distributions, and codon usage metrics calculated from coding sequences.By employing the predicted protein reserve ratios, it was possible to calculate the in vivo protein abundances, matching the experimental proteomics measurements, since both its ratio relative to predictions and the predictions themselves are known.For E. coli, the CAMEL-calculated in vivo protein abundances achieved a Pearson correlation to experimental proteomics measurements of over 0.9, while for S. cerevisiae, it obtained a correlation of 0.5.
The described data-driven models also vary widely in how they are developed, how accessible is the source code, and how accessible is their documentation.The joint learning approach from Li et al.
(2019) has its entire code and data sets available on GitHub, with basic instructions to reproduce the analysis of the manuscript.The software tool RBSeval is also available on GitHub and is implemented as a command-line tool, requiring as input only a FASTA file of coding sequences.However, since it depends on features only available on prokaryotes (e.g., Shine-Dalgarno sequence), it is not applicable with data from eukaryotic organisms.The code and data for the predictive models of Ferreira et al. (2021) are also available on GitHub, including code for reproducing the analysis in its manuscript.The CAMEL approach also has its code and data available on a GitHub repository, for both the constraint-based part and the machine learning part, allowing for the reproduction of the manuscript's findings and for usage for other organisms.

| Predictions using constraint-based models
Apart from data-driven models, constraint-based models have also been employed to predict protein abundance.More specifically, they predict enzyme concentration in pcGEMs or resource allocation models, given the relation v k E ≤ • [ ] . Goelzer et al. (2015) reconstructed an RBA model of Bacillus subtilis to study its physiological processes.They obtained predictions of fluxes, enzyme abundances, and resource costs of cellular processes.Regarding the predictions of enzyme concentrations, they obtained an R 2 value of 0.94 for growth simulations when using a minimal medium added with pyruvate, glucose, or a combination of glucose and glutamate as carbon sources.The framework developed by Heckmann et al. (2018) was also used for the prediction of protein abundance, given that the machine learning-predicted k app,max values were used in the pcGEMs.
This was used to assess the improvement of using k app,max values over k cat values, which resulted in a prediction error 43% lower for the former when using MOMENT or ME models.Similarly, the k cat values predicted using the tools DLKcat (Li, Yuan et al., 2022) and TurNuP (Kroll et al., 2023) were also used to calculate protein abundances, and both approaches were compared to the same experimental data.
For DLKcat, pcGEMs parameterized with their predicted k cat values achieved a root mean squared error 30% lower than pcGEMs parameterized with their original k cat values.For TurNuP, pcGEMs parameterized with its predicted k cat values could predict protein abundance more accurately than DLKcat in 19 out of 21 growth conditions, with an average of 18% lower mean squared errors between measured and predicted protein abundances.(Ferreira et al., 2023), proposes the minimization of the Manhattan (linear) or Euclidean (quadratic) distances between the enzyme usage distribution of a reference growth condition and an alternative growth condition, with or without the consideration of metabolic fluxes.PARROT is available on a GitHub repository, and is implemented in MATLAB, as is therefore compatible with pcGEMs generated with the GECKO Toolbox (Sánchez et al., 2017).When compared to experimental proteomics data, PARROT achieved higher Pearson correlations than other methods, such as pFBA, which is the current standard for pcGEMs (Domenzain et al., 2022).The second approach is part of the OVERLAY tool (Yao et al., 2023), which implements an optimization problem to minimize the Euclidean distance between a transcript abundance distribution obtained from RNA-seq data and the enzyme usage distribution, on the assumption that both distributions are similar and that the predicted enzyme usage distribution maintains metabolic feasibility.This approach is thus dependent on the availability of gene expression data, while PARROT has no requirement of additional data to generate predictions.The predicted enzyme usage distribution from OVERLAY highly matched the RNAseq data, achieving an R 2 score of over 0.9.However, given that the resulting enzyme usage distribution is simply the result of a distance minimization to the RNA-seq data, this result is an artifact of fitting the RNA-seq data as part of the approach.
While current pcGEMs have expanded our understanding of enzyme usage and allocation, allowing for multiomics data analysis and design of metabolic engineering strategies, there is still a lot of room for improvement, especially for use cases not assessed by the proposed methods (e.g., metabolic engineering strategies directed at improvement of enzyme catalytic properties, coupled with over/ underexpression of enzyme abundances).In addition, some standing questions remain unanswered, which should guide the development of new approaches for the estimation and integration of enzyme constraints.

| FUTURE DIRECTIONS FOR ESTIMATION AND INTEGRATION OF ENZYME CONSTRAINTS: STANDING QUESTIONS AND NEW OPPORTUNITIES
6.1 | Do the predicted protein abundances match protein allocation?Cells produce enzymes in higher quantities than necessary to sustain growth, meaning that they do not operate at their full catalytic capacity (O'Brien et al., 2016).This overabundance of enzymes could act as a buffer to allow the cell to quickly adapt to variations in nutrient uptakes or other changes in the environment (Mori et al., 2017).However, this also raises the question of whether predictions of protein abundance can match the amount of protein Along these lines, models that integrate macromolecular machineries and resource allocation strategies, such as ME models and RBA models, can capture the protein allocation principles, but they depend on biochemical parameters that can be difficult to obtain.Data-driven models could be used to overcome this problem, as features used to train the models such as mRNA expression and codon usage bias can be a proxy for in vivo enzyme concentrations.A coupling of data-driven and constraint-based models could further enhance predictions, as proposed in the CAMEL approach (Ferreira et al., 2023), where the ratio between pcGEM-predicted protein abundance values and in vivo measurements was used to train machine learning models.This development highlights the opportunities for integrating multiomics and multimodel approaches.

| Toward protein-constrained models of microbial communities
In natural habitats, microbes seldom live isolated.Instead, they engage in complex social interactions that shape their ecosystem (Konopka, 2009).GEMs of microbial communities have also been reconstructed and analyzed using FBA and its derivations, and they provide valuable insights on how these communities are organized and how they function at the metabolic level (Ibrahim et al., 2021).
However, integration and estimations of turnover numbers and enzyme abundances have been so far limited to GEMs of single organisms, and absolute quantification of proteins in microbial communities is still an underexplored venue.Possible challenges for reconstructing pcGEMs for microbial communities include reliance on data from single organisms, such as the k cat values deposited on BRENDA or SABIO-RK, and the lack of data for in silico estimation of these parameters.Even if data were available, the taxonomic heterogeneity of species in the GEM can make it difficult to map turnover numbers to the reactions, as k cat values can be different from one species to another, which could impact simulations.
Community models can be reconstructed by either combining all the overlapping genetic and metabolic information as if it were a single organism or by reconstructing many small individual models that correspond to one species, where the small individual models are treated as compartments of a bigger model with a shared extracellular compartment (Dillard et al., 2021).Therefore, the structure of the community model could also impact how enzymatic constraints are integrated.A possible workaround for integrating k cat values in community models could be the representation of reactions as devised by Bulović et al. (2019) for RBA models, where they bypass the need of specifying parameters for each protein individually by using the BiPON ontology (Henry et al., 2017)-an annotation that represents cellular processes in a unified way, to represent overlapping processes as one entity, adhering to what is done for community models when reactions shared between species are lumped to a single reaction.This opens the opportunity for novel approaches in integrating enzymatic constraints in community models.

| Heterogeneity of cell types in higher eukaryotes
The integration of k cat values for most prokaryotes and single-celled eukaryotes is straightforward, as there are no variations in cell type that could drastically change enzyme activity when calculating k cat values from estimated fluxes (Ohno et al., 2008).In multicellular organisms, such as plants and animals, however, cells can differentiate and form tissues and organs, with each cell type finely tuned to a specific metabolism, which could present a challenge for the integration of k cat values in GEMs of multicellular organisms.One workaround to this problem is developing a cell-agnostic model, which is reconstructed using all reactions from all cell types and tissues.This cell-agnostic model can then be constrained using transcriptomics or proteomics to generate tissue-specific models, such as the 126 models of human tissues reconstructed using the mCADRE tool (Wang et al., 2012); the 11 tissue-specific human models and models of guard cells and mesophyll cells of Arabidopsis thaliana reconstructed using the RegrEx approach, which also provides flux distributions (Robaina-Estévez & Nikoloski, 2014;Robaina-Estévez et al., 2017).By using protein measurements from multiple tissues, one could obtain estimates of catalytic rates that correspond to a specific cell type, allowing for more accurate reconstructions of tissue-specific pcGEMs.A question that arises is whether such k cat values would be dependent on the tissue context, as different cell types-for example, photosynthetically active mesophyll cells versus starch-accumulating root cells-have evolved cellular phenotypes with different metabolic objectives and may thus possess different metabolic environments where kinetic properties can differ.

| Usage of fluxes determined from labeling studies
An interesting point to consider is the usage of fluxes determined from labeling studies to estimate k cat values, rather than using fluxes determined from constraint-based approaches like FBA or pFBA.
Experimental flux estimation comes from 13 C-based metabolic flux analysis ( 13 C-MFA), which makes use of tracers labeled with 13 C, such as a carbon source used for growth (Zamboni et al., 2009).As this tracer gets consumed and integrated into other metabolites, a particular labeling pattern will be achieved.The metabolites that incorporate the tracer can then be detected by analytical techniques such as nuclear magnetic resonance or mass spectrometry (Fischer et al., 2004;Truong et al., 2014).Internal in vivo flux estimations can be derived by applying these measurements to a small-scale metabolic model and solving a nonlinear least squared regression problem (Sokolenko et al., 2016).Using flux values estimated from 13 C-MFA comes with advantages such as having more reliable and precise measurements of metabolic fluxes (Crown & Antoniewicz, 2013).Using 13 C-MFA data, estimations of k cat should also be of higher accuracy.However, this method comes with challenges such as the resource intensiveness of obtaining the data and the flux estimates (due to the large nonlinear optimization problems solved; Sokolenko et al., 2016).Experimentally, the analytical steps require advanced training and expensive instrumentation, and the scale of estimations is still limited, being often confined to central metabolic pathways (Ohno et al., 2022).All in all, as 13 C-MFA becomes more widely accessible, k cat estimations can take advantage of the more precise flux estimations, leading to even better phenotype predictions by pcGEMs.

| Underground metabolism and promiscuous enzymes
Many enzymes display side activities by catalyzing reactions other than their main reaction.These enzymes are known as promiscuous, and they play an important role in metabolism (Amin et al., 2019).In the cell context, promiscuous enzymes can form an alternative metabolic network of reactions termed underground metabolism.While many of these reactions are physiologically irrelevant given the low enzymatic activity, they can act as a reservoir of novel enzyme functions that can arise in circumstances where the side reaction is favored (Rosenberg & Commichau, 2019).Evolution of such novel functions can also be exploited for biotechnological purposes in adaptive laboratory evolution (ALE) experiments (Kovács et al., 2022).
In a constraint-based modeling context, promiscuous enzymes affect the GPR rules in a way that multiple reactions will be catalyzed by the same enzyme.This makes it challenging to integrate k cat values for promiscuous enzymes, as databases might not contain information about noncanonical enzyme/substrate pairs, since the difference in substrate affinity means each reaction will have different k cat values (Davidi et al., 2016).Experimental estimation of catalytic rates of | 927 promiscuous enzymes also run into many roadblocks, because the products of side reactions might be unknown, and the yield of such products is undetectably low (Waki et al., 2021).Even if data would be available, many approaches do not have mechanisms to deal with promiscuous enzymes.Approaches such as MOMENT and PAM assign a single k cat value to the enzyme irrespective of the substrate.
The eMOMENT approach explicitly accounts for enzyme promiscuity by adding a constraint that limits the abundance of an enzyme to the sum of all enzyme abundances across their respective associated reactions, but still does not take into account substrate affinity.
Nevertheless, data-driven approaches such as DLKcat have shown good promise of estimating catalytic rates of promiscuous enzymes (Li, Yuan et al., 2022).As biochemical and computational methods become more refined, the inflow of data might push for the development of novel approaches for considering promiscuous activity in pcGEMs.This could enhance the simulation of ALE experiments and further boost metabolic engineering endeavors.

| CONCLUSION
From 191 catalytic rates estimated by Valgepea et al. (2013) to more than 300,000 catalytic rates estimated by Li, Yuan et al. (2022), there have been great strides in the parametrization of pcGEMs, with many methods for estimating enzyme kinetics and enzyme abundance being developed and achieving good agreement with experimental measurements.The many approaches developed to integrate catalytic rates and enzyme usage in GEMs have contributed significantly to help understand complex phenotypes and assist in metabolic engineering endeavors.They also provide an opportunity in which multiple omics data sets can be integrated.Some challenges still remain, though, which should provide a fulcrum to future research.

ACKNOWLEDGMENTS
This study was supported in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)-Finance

FERREIRA
are represented in the model, and metabolic fluxes are predicted by including substrate-enzyme binding and product-enzyme dissociation reactions.They may also include information on protein translocation, compartmentalization, folding, and thermostability and includes documentation and a step-by-step tutorial to reproduce the pcGEM generated in its manuscript.It provides an automated tool to reconstruct pcGEMs, with model complexity comparable to sMOMENT and ECMpy.After enzyme constraints are integrated into GEMs, giving rise to pcGEMs, a plethora of previously unattainable phenotypes are now able to be simulated.Simulations of metabolic switches, like the overflow metabolism, are often missed by conventional GEMs.Overflow metabolism denotes the phenomenon where metabolic flux is totally or partially redirected from respiratory pathways to fermentation pathways, despite the availability of oxygen(de Alteriis et al., 2018).
that these estimations could be used in pcGEMs.Proteomics measurements were used for E C [ ( )], while v C ( ) j was determined by parsimonious enzyme FBA (pFBA), which minimizes the total flux through the network and constrains the model by growth rate and culture medium composition.By assessing proteomics experiments performed in 31 different growth conditions, they estimated conditionspecific catalytic rates and took the maximum value as the maximum k app , or k max vivo .The usage of predicted fluxes, instead of fluxesdetermined by metabolic flux analysis(Valgepea et al., 2011), allowed the estimation of catalytic rates for a number of enzymes larger than previously obtained byValgepea et al. (2013).These estimated values were in good agreement with in vitro k cat values, which shows the precision of the method in predicting catalytic rates.Next,Heckmann et al. (2018) used machine learning to predict the catalytic rates k cat and k app,max using a feature set composed of structural, biochemical, and network data, such as molecular weight, structural disorder, active site structure and function, enzyme comission (EC) number, metabolic flux, Km, pH, and temperature.

FERREIRA
Five different regression models were trained and assessed: linear regression, partial least squares, elastic net, random forest, and a deep neural network.Based on the coefficient of determination (R 2 ) as a goodness-of-fit measure, the best-performing model was the random forest, which achieved median R 2 scores of over 0.75, for both the training and test data sets.An analysis of feature importance revealed metabolic flux to be the most important feature for the prediction of both catalytic rates.They found that models using k app,max values had better predictive capability than models using k cat values, based on root mean squared error (RMSE) of predicted enzyme usage values.A follow-up on the study used proteomics and fluxomics data to estimate the catalytic rates instead of using pFBApredicted metabolic fluxes, achieving more precise estimations(Heckmann et al., 2020).A further development using constraint-based approaches was performed byXu et al. (2021).This approach rests on the observation that in bothDavidi et al. (2016) andHeckmann et al. (2018) there are many expressed enzymes that carry no flux, called idle enzymes(Xu et al., 2021).For that reason, Xu et al. (2021) formulated a two-step mixed-integer linear program termed NIDLE-flux, that maximizes the number of enzymes that carry flux.With this approach, it is then possible to increase the number of catalytic rate values that can be estimated.The resulting metabolic flux is then used together with protein abundance data to estimate k max vivo as in the approach ofDavidi et al. (2016).This approach led to a 1.4-fold increase in the number of estimated k max vivo values, compared to estimations byDavidi et al. (2016) andHeckmann et al. (2020).
has inspired the development of other data-based models using different algorithms and training data to improve predictions.In this regard, Yu et al. (2023) go in a different direction compared to DLKcat and propose the usage of a pretrained language model to predict k cat values instead of a convolutional neural network.This approach, termed PreKcat, depends on amino acid sequences and the molecular structure of substrates and addresses two problems: predicting K m values and predicting k K / m cat ratios (i.e., catalytic efficiency).PreKcat achieved a Pearson correlation of 0.83 on the test data set, which contained enzymes not seen in the training data set.Similar to DLKcat, PreKcat was developed using PyTorch and is available on a GitHub repository.The tool requires several pretrained language models preinstalled, and the documentation is still under construction as of writing.Despite the improvements of PreKcat, the sequence similarity between enzymes in the training data set and test data set can still bias the predictions, even if the enzymes themselves are different.
The described constrained-based approaches relied on predicting protein abundances by directly considering the relation between k cat FERREIRA ET AL. | 925 values, protein abundances, and metabolic fluxes that reflect a given physiological state.Changes in growth conditions are often accompanied by changes in the allocation of proteins, to facilitate the establishment of new homeostasis.Two approaches based on the minimization of metabolic adjustment principles have been developed to specifically predict the adjustment of enzyme usage.The first approach, termed PARROT that is actually allocated by the cell.Using the relationship correspond to the optimal abundance of enzymes necessary to carry the provided flux with the provided catalytic rate.This means that these calculations underestimate in vivo enzyme concentrations.As a result, it poses a challenge even for integration of protein abundance values in pcGEMs, since not the entire enzyme pool is used to carry flux, rendering enzymes unsaturated.Thus, a range of enzyme usage values could allow the pcGEM to provide similar predictions, as metabolite levels would determine the alteration of metabolic fluxes.This raises the question of how important absolute protein measurements actually are, since in scenarios in which metabolic fluxes reach respective thresholds, increasing protein abundance would have negligible effect on flux variability.Addressing this question would require directing research efforts toward obtaining proteomics data from scenarios in which enzyme usage thresholds are met.

FERREIRA
, the S. cerevisiae model ecYeast8 can also predict the order of consumption of the carbon sources with a good correlation with experimental data.The dFBA method introduces kinetic equations for extracellular metabolites and biomass, allowing (Moreno-Paz et al., 2022)lation of metabolism.When growth is simulated using a combination of glucose and sucrose as carbon sources, ecYeast8 first predicts the consumption of glucose as the initial carbon source.When glucose is depleted, the hydrolysis of sucrose occurs with the subsequent consumption of glucose as a carbon source.In this scenario, fructose was left unused until glucose was depleted, which was then used to support a third phase of growth(Moreno-Paz et al., 2022).Considering the importance of k cat values to simulate phenotypes such as diauxic growth and overflow metabolism, and the challenges with in vitro k cat measurements, there is a growing need to find alternatives to experimental measurements, as we describe in the next section.4|APPROACHES FOR ESTIMATION OF k c a t VALUESFour different computational approaches have been proposed to estimate in vivo k cat values.These approaches rely on the relationship between fluxes, enzyme concentration, and k cat values or rely on data-driven models trained on enzyme biochemistry data and features derived from biological sequences.