Application of dynamic metabolic flux analysis for process modeling: Robust flux estimation with regularization, confidence bounds, and selection of elementary modes

Abstract In macroscopic dynamic models of fermentation processes, elementary modes (EM) derived from metabolic networks are often used to describe the reaction stoichiometry in a simplified manner and to build predictive models by parameterizing kinetic rate equations for the EM. In this procedure, the selection of a set of EM is a key step which is followed by an estimation of their reaction rates and of the associated confidence bounds. In this paper, we present a method for the computation of reaction rates of cellular reactions and EM as well as an algorithm for the selection of EM for process modeling. The method is based on the dynamic metabolic flux analysis (DMFA) proposed by Leighty and Antoniewicz (2011, Metab Eng, 13(6), 745–755) with additional constraints, regularization and analysis of uncertainty. Instead of using estimated uptake or secretion rates, concentration measurements are used directly to avoid an amplification of measurement errors by numerical differentiation. It is shown that the regularized DMFA for EM method is significantly more robust against measurement noise than methods using estimated rates. The confidence intervals for the estimated reaction rates are obtained by bootstrapping. For the selection of a set of EM for a given st oichiometric model, the DMFA for EM method is combined with a multiobjective genetic algorithm. The method is applied to real data from a CHO fed‐batch process. From measurements of six fed‐batch experiments, 10 EM were identified as the smallest subset of EM based upon which the data can be described sufficiently accurately by a dynamic model. The estimated EM reaction rates and their confidence intervals at different process conditions provide useful information for the kinetic modeling and subsequent process optimization.


| INTRODUCTION
For model-based optimization of fermentation processes, for example, for process design or control, simple dynamic models which are accurate enough to predict the process behavior under varying conditions are needed (Frahm et al., 2002;Neddermeyer, Rossner, & King, 2015;Teixeira, Alves, Alves, Carrondo, & Oliveira, 2007).
Essential elements of models of fermentation processes are the stoichiometry of the biochemical conversion and the dependency of the reaction rates on the process conditions. The metabolism of the cells is very complex and comprises hundreds of chemical reactions, so that it is infeasible to derive rate equations for all these reactions. For the derivation of efficient models-efficient meaning sufficiently accurate with predictive capabilities but not overly complex-the usage of small metabolic networks at steady state (Nolan & Lee, 2011) or selections of elementary modes (EM) as macro reactions (Gao, Gorenflo, Scharer, & Budman, 2007;Provost, 2006;Soons, Ferreira, & Rocha, 2011;Teixeira et al., 2007) have been shown to be a powerful approaches. EM are calculated from a metabolic network and therefore provide a physiologically meaningful abstraction of the metabolism without the need of including dynamic intracellular mass balances and reaction kinetics. Formal kinetics or black-box models like multilayer perceptron networks (MLP) can then be used to model the dependency of the reaction rates of the EM on the process conditions or on the concentrations of species in the reactor.
Alternative modeling approaches use empirical qualitative reaction schemes as macro reactions and fit the corresponding stoichiometric coefficients to data (Herold & King, 2014;Mailier & Wouwer, 2009). The complexity of these models is comparable with the complexity of models which use EM as macro reactions, as internal balances and reactions are lumped onto a few macroscopic pathways.
However, physiological constraints as, for example, balances of internal components, energy carriers, or redox-species cannot be taken into account and available biological knowledge is neglected.
For the generation of models that are based on EM it is necessary to (a) select a set of EM from a usually large number of possible EM and (b) select and fit kinetics for each of the reactions in the set.
Previously published methods for the selection of EM use estimated uptake or secretion rates. Soons et al. (2010) use a controlled random search algorithm to find the best set of EM which minimizes the difference to the estimated rates, and Abbate et al. (2019) link the number of reactions to the fraction of the explained variance in the estimated rates by comparing eigenvalues of a SVD decomposition. The subset is then found by solving a linear optimization problem.
In both approaches, the trade-off between the size of the set of EM and the accuracy of the representation of the estimated rates is exploited.
For the selection and fitting of kinetics, estimates of the cellspecific EM reaction rates are needed. Several algorithms that have been proposed for this analysis use measured cell-specific fluxes of medium components (Poolman, Venkatesh, Pidcock, & Fell, 2004;Schwartz & Kanehisa, 2006).
A disadvantage of these algorithms for the selection of the EM and the estimation of their reaction rates is that the estimation of the cell-specific uptake or secretion rates has to be carried out first. This implies that derivatives of concentrations have to be computed in the first step, and, as the data is usually significantly corrupted by measurement errors, the resulting rates show large fluctuations as the derivation amplifies the errors.
In this paper, we present methods for the analysis and selection of EM for process modeling where measurement data is used directly: A method for the analysis of EM reaction rates is presented which is based on the approach for dynamic metabolic flux analysis (DMFA) by Leighty and Antoniewicz (2011) for the computation of internal flux distributions of a metabolic network. This method was not developed for the analysis of EM, but, as will be shown, it can be used for this purpose as well. The advantage of the approach by Leighty and Antoniewicz (2011) is that random noise in the measurements is attenuated by solving a linear optimization problem such that smoothing and numerical differentiation is not necessary. The method from Leighty and Antoniewicz (2011) is extended in this paper by the following elements: • Additional linear constraints are included so that the fluxes are estimated taking into account the irreversibility of certain reactions.
• For large sets of reactions, the objective which was proposed in the approach of Schwartz and Kanehisa (2005) is considered in the objective of the DMFA method as a regularization term.
• The propagation of the measurement errors to the computed rates is obtained by bootstrapping.
It is shown that the regularized DMFA for EM method is more robust against measurement noise than other methods which are based upon the estimation of cell specific fluxes and that tuning of the algorithm is straightforward by using cross-validation.
The selection of a subset of EM which are suitable for dynamic process modeling is determined by the following procedure: (1) Before the EM are analyzed, the original DMFA problem with additional constraints to account for the irreversibility of certain reactions provides the best possible fit for a given metabolic network and therefore can be used as a benchmark. A suitable selection of EM should exhibit a comparable fit to the data. If the quality-of-fit is sufficient, the selection of EM can be carried out, otherwise essential reactions in the metabolic network are missing.
(2) A first reduction is carried out by means of geometrical reduction in which geometrically similar EM are discarded from the possibly large set of possible EM.
(3) For a further reduction, a multiobjective genetic algorithm (GA) is used together with the DMFA for EM method. The two objectives of the multiobjective GA are to optimize the fit of the predictions to the measured data and to minimize the size of the employed subset of the EM To explore the trade-off between a good fit to the data and a low number of EM which is preferable because it reduces the model complexity.
After a set of EM has been found, the pdf of the corresponding EM reaction rates over time at different process conditions can be HEBING ET AL. | 2059 evaluated by a bootstrap method in which the estimation is repeated with resampled measurements. This information is useful for selecting and fitting kinetic expressions of the reactions by statistical methods. A quantification of the uncertainty of the estimated cell-specific rates is necessary as the magnitude of this uncertainty varies considerably during the course of a fermentation process. Figure 1 gives a graphical overview about the different steps of our modeling procedure.
Some elements of this procedure were already published presented in (Hebing, Neymann, Thüte, Jockwer, & Engell, 2016). This contribution extends this study by (a) adding regularization to the DMFA for EM problem, (b) adding linear constraints to the original DMFA approach, (c) showing how the bounded DMFA method can be used for the evaluation of a metabolic network, and (d) calculating confidence intervals of the specific reaction rates.
The selection and analysis of EM from a small metabolic network is demonstrated in Section 4 with real data from a CHO cultivation fed-batch process under varying conditions based on an extended metabolic network from Nolan and Lee (2011). The selection and fitting of the kinetic equations based on these estimates is only sketched, as this is beyond the scope of this contribution.

| THEORY
In this section, a short introduction of the original DMFA method by Leighty and Antoniewicz (2011) (which we will call unbounded DMFA) and the related new formulations (bounded DMFA and DMFA for EM) is given. Furthermore, the evaluation of the confidence intervals of the DMFA estimates and the regularization are explained.

| Dynamic metabolic flux analysis
The differential equations that govern the evolution of the concentrations of the species in the reaction medium in a batch process (i.e., without addition or removal of substances from the reaction volume) can be calculated from the time-dependent vector of fluxes in the metabolic network, t ν ( ): where P is the (external) stoichiometric matrix, t ν ( ) the cell-specific flux vector, and X t v ( ) is the concentration of viable cells. The volumetric flux vector t ¯( ) is: This results in a, usually under-determined, system of linear equations: where N is the (internal) stoichiometric matrix. Under this assumption, t _ ( ) can be calculated from volumetric rates of the so-called free fluxes U t ( ): where null(N) denotes that N K 0 ⋅ = . All fluxes which satisfy Equation (4) fulfill the steady state assumption for the internal metabolites. Leighty and Antoniewicz (2011) assume that the volumetric free fluxes U t ( ) are piece-wise linear over time between inflection points T j , F I G U R E 1 Overview of the steps for the selection and analysis of EM for process modeling. After a metabolic network has been chosen, the possible quality-of-fit can be tested using the bounded DMFA method. If this fit is sufficient, a selection of EM from this network can be performed using the geometrical reduction and the multiobjective genetic algorithm.  In the bounded DMFA problem, constraints for certain internal fluxes that are non-negative due to the irreversibility of the corresponding reactions _ irr are taken into account: All fluxes which are calculated from Equations (4) and (127) fulfill the steady state assumption and the irreversibility constraints.
The (volumetric) fluxes t ¯( ) which fulfill the steady state assumption and irreversibility constraints can also be expressed as a non-negative linear combination of reaction rates of EM, assembled in the vector R t ( ) (Schuster & Hilgetag, 1994): where E is the elementary mode matrix. The profile of the (volumetric) EM reaction rates over time R t ( ) is obtained by solving the DMFA for EM problem (9): can be obtained from the volumetric rates R t ( ) by:

| Regularization: Minimizing the norm of the specific reaction rates
For large sets of EM, the solution of the estimation problem (9) may lead to ill-conditioned problems, so that even large changes in R T k j ( ) result in only small changes of the value of the objective function, as many EM with a similar stoichiometry exits. To overcome this problem, a penalty term is added to the cost function: where X v is the concentration of viable cells and α is a scalar weighting factor which optimal value is found by cross-validation.
With this additional term, the L2-norm of the cell-specific rates is penalized. Penalizing cell-specific rates is preferred over penalizing volumetric rates as unwanted and unrealistic behavior on the cellular level is less likely to occur. Otherwise, unrealistically high cell-specific rates might be obtained from the DMFA for EM method in the beginning or at the end of a process where the concentration of viable cells is low. This criterion is also used by Schwartz and Kanehisa (2005) for the estimation of EM reaction rates.

| Confidence intervals by resampling of measurements
Classical methods for the estimation of confidence intervals like the Cramer-Rao lower bound are problematic in systems with many parameters from which a few might be unobservable, which can happen in the DMFA method when a large number of reactions are involved. Additionally, constraints on the reaction rates cannot be taken into account.
We therefore propose to use a bootstrap method instead. Here, the measurements are resampled from their expected probability density functions. Often, the probability density functions of the measurements can be assumed to be Gaussian distributed and the variances can be estimated or are available from sensor data. Alternatively, these uncertainties can be estimated by a maximum likelihood estimator.
The estimation of the reaction rates with the DMFA method can then be repeated using the sampled data. However, two major changes have to be made to get rid of the estimation bias by the DMFA method: • The position of the inflection time-points T j must be randomized.
• The regularization coefficient α must be zero or small such that numerically ill-conditioned estimations can be carried out without strongly influencing the result.
From the sampled estimates of the reaction rates and their confidence intervals are obtained.  (7) and (9) therefore are equal if the complete set of EM is used. When some EM are removed from the complete set, the corresponding columns in E and elements of R (Equation (8)) are deleted and the calculated SSR will increase. So the SSR value from the solution of the bounded DMFA provides a lower bound which is only dependent on the assumed metabolic network. Choosing a subset of EM will lead to a higher SSR value, that is, a worse fit to the measurement data. If the results from the bounded DMFA are not sufficiently accurate, a modification of the metabolic network should be considered before a subset of EM is chosen.
For the selection of a subset for a process model, two algorithms are proposed: (1) A geometrical reduction, which discards similar EM from the original set based on their cosine similarity. This algorithm can be used for a preliminary reduction if the initial number of EM is very large. The SSR value which is calculated using the DMFA for EM method can be used to ensure that no significant EM is removed. The algorithm is described in the appendix.
(2) A multiobjective GA which considers as objectives the SSR value and the number of EM in the set.  (9) is solved using a linear solver to obtain the corresponding SSR value. The optimization problem can be written formally as: The resulting Pareto front describes the SSR which results from an optimized selection of EM as a function of the cardinality of the set of EM. The Pareto front helps the modeler to decide how many EM are necessary to capture the measured behavior of the process with an acceptable model error. The SSR value approaches the SSR value of the bounded DMFA problem when the number of EM in the subset increases.
Thus, before any kinetic expression is fitted, the SSR value of the resulting sets of EM give an indication of the expected quality-of-fit and of the complexity of the process model which is based on this set of reactions. Practically, one will search for the lowest number of EM which still provide a desired fit to the data.

| SIMULATION STUDY: EM ANALYSIS USING NOISY CONCENTRATION DATA
In this simulation study, the capability of predicting reaction rates of EM from noisy concentration measurements is tested. We compare the DMFA for EM method that was presented above with other commonly used techniques that are based on the analysis of cellspecific uptake or secretion rates, namely Poolman et al. (2004) and Soons et al. (2011). As the estimation method of these cell-specific rates from time-series of noisy concentration measurements plays a significant role in the analysis, several combinations for smoothing, interpolation, and filtering are also tested.
A small metabolic network with known kinetic terms was used to generate artificial measurement data of a fed-batch process at different levels of measurement noise. The complete model, the chosen boundary conditions and the estimation methods are described in the appendix. Figure 2 shows all estimated reaction rates together with the real reaction rates of the EM.
It can be seen that either smoothing or filtering is necessary to obtain good estimates when the Poolman or the Soons methods are used. The choice of the interpolation and filtering method which is used for calculating the specific rates has a significantly higher impact on the result than the estimation method itself.  Note: The calculation of specific rates is carried out using different interpolation (splines, smoothing splines)-and filtering methods (MA = moving average filter). No smoothing and filtering is employed for the DMFA for EM method. More details are described in the appendix.
T A B L E 2 Reactions of the reduced metabolic network from Nolan and Lee (2011)

| REAL WORLD EXAMPLE: EM SELECTION AND ANALYSIS USING EXPERIMENTAL CHO FED-BATCH FERMENTATION DATA
The concept for the choice of an EM reaction set was applied to datasets from six fed-batch fermentations of a CHO culture. One experiment was excluded from this procedure and used as a validation data set for the model which was built using the chosen set of EM and fitted to the other five data sets. For brevity, we will show only three of the remaining five experiments in the following which represent the most "extreme" responses of the process to different set-points for the pH value and glucose levels. Measurements of the viable cell density (X v ), the total cell density, concentrations of the components of the medium, and dissolved oxygen-sensor information were available and used. For confidentiality reasons, not more details can be given and not all measured components can be shown in the following.

| Metabolic network
For the calculation of pseudo-batch data from the fed-batch measurements, the influence of the liquid feed as well as the masstransfer over the gas-liquid phase boundary were compensated according to the shifting method which is described in the appendix.
F I G U R E 3 Measured data of different medium concentrations of one data set with the profiles of the estimated concentrations ĉ which were generated using the bounded DMFA (in blue, left column) and unbounded DMFA (in red, right column) methods. The color shadings refer to the 95% and the 68% confidence bounds and the median of all estimations after 500 resamples of the measurements (cf. Section 2.3). The estimated evolutions of the concentrations do not differ much, but from the evolution of mAb and Amm it can be seen that flux bounds bounds are useful for the reduction of the confidence bounds as the uptake of these substances is ruled out in the bounded DMFA. DMFA, dynamic metabolic flux analysis [Color figure can be viewed at wileyonlinelibrary.com] The metabolic network from which the EM were calculated was taken from Nolan and Lee (2011) and extended by the glyoxylatecycle and a maintenance reaction in which ATP is consumed. The network reactions are listed in Table 2.
Before the EM analysis and the choice of an EM subset, it was determined how well the network can explain the evolutions of the concentrations in the measured data. With the methods unbounded DMFA (5) and bounded DMFA (7), the fluxes of the reactions in the F I G U R E 4 Estimated reaction rates of eight chosen reactions from the metabolic network obtained using the bounded DMFA (in blue) and unbounded DMFA (in red) methods for the data of one CHO fermentation fed-batch experiment. The color shadings refer to the 95% and the 68% confidence bounds and the median of all estimations after 500 resamples of the measurements (cf. Section 2.3). Although the estimated concentrations ĉ are similar in the unbounded and bounded estimations, the differences in the estimated rates are quite large for some intracellular reactions. The major reason for this is the reversibility of the death rate in the unbounded DMFA method which also leads to lower reaction rates in anaplerotic reactions in the TCA cycle. The low concentration of the biomass X v in the beginning of the process leads to larger confidence bounds of the rates at these time-points. DMFA, dynamic metabolic flux analysis [Color figure can be viewed at wileyonlinelibrary.com] network and their confidence intervals were estimated by minimizing the difference between the estimated concentration profile and the (resampled) measured data (cf. Section 2.3). Figure 3 shows the fit to the data of one experiment for four out of 10 different concentration profiles. Figure 4 shows the estimated reaction rates from eight chosen reactions of the metabolic network. It can be seen that the irreversibility of some reactions is violated when the unbounded DMFA method is used which leads to unrealistic results.
The quality-of-fit of the bounded DMFA method determines how well the measured concentration profiles can be expressed by any selection of EM of the network. As this quality-of-fit is satisfactory, the EM analysis and the selection of a subset of EM were carried out.

| Identifying the set of active EM
Using the metatool-software (Pfeiffer, Nu, Montero, & Schuster, 1999 From the Pareto front of the multiobjective evolutionary algorithm, it is easily possible to select a subset of EM for a process  model. The set should be as small as possible but with an acceptable SSR value (i.e., an acceptable fit to the data). The fit to the data of one experiment for four out of ten different concentration profiles is shown for different selections of EM subsets in Figure 6. As expected from the Pareto front (cf. Figure 5), the quality of fit of subsets with 10 reactions and more is comparable to the result obtained by the bounded DMFA method. For <10 reactions, the approximation becomes significantly worse. With <10 EM, essential pathways seem to be missing and the data cannot be reconstructed well enough. The stoichiometry of the 10 EM is shown in Table 3. The actual state of the metabolism can sufficiently be described by a combination of these metabolic modes. The following metabolic features are present in the set of EM: • Oxic-and anoxic conversion of glucose with-and without lactate formation • Different yields for biomass and/or product formation • Utilization of different nitrogen sources • Reutilization of by-products like lactate.

| EM reaction rates and confidence intervals
After the set of 10 EM has been found, the confidence intervals for the EM reaction rates are estimated using the bootstrap method (cf. The estimated EM reaction rates and their confidence intervalsare shown in Figure 8. It can be seen that the magnitude of the estimation uncertainty varies significantly over time. Especially in the beginning and at the end of the process, the confidence bounds of the reaction rates are very large. This is due to the low concentration of viable cells. As a consequence, no significant differences in the reaction rates can be observed. Only in the middle of the process, some reaction rates are clearly different due to the different process conditions. In this example, the reaction rates of EM 1 , EM 5 , and EM 9 are not significantly influenced by the different process conditions but the other EM show observable differences.

| Further modeling steps
After the EM have been chosen and analyzed, kinetic equations must be selected for each reaction which should adequately represent the influence of the process conditions on the reactions. The selection and fitting of kinetics is not in the scope of this paper so these steps are only sketched.
The information about reaction rates and their confidence intervals is very useful for the modeling of the process, as it enables the modeler to differentiate between significant and nonsignificant influences on the reaction rates. The kinetic functions for the reaction rates r f c pH T _, , , = ( …) should express only the significant influences.
The results of the real-world example showed that the evaluation of the confidence intervals is especially useful for fitting specific rate equations to estimates as the observability of these rates heavily depends on the state of the process. In this example, the range of the confidence intervals comprises several orders of magnitude.
In an earlier contribution (Hebing et al., 2016;Neymann, Hebing, & Engell, 2019) we proposed to select and fit nonlinear reaction kinetics r (Θ) to estimated rates of EM r by solving: here, Θ is the parameter vector and t i σ ( ) is the standard deviation of the estimate which can be obtained from the bootstrap samples of the estimate r t ( ). Only with a reliable estimate of σ , it is possible to fit and compare meaningful kinetics based on a statistical measure as, for example, the Akaike information criterion.
Also black-box models like MLP for reaction kinetics of the selected EM can be fitted to the estimated reaction rates r t ( ), for example, by back-propagation. Here also, the standard deviation of the rates t σ ( ) can be used for the weighting of the values at different time-points. If this information is neglected, the identified rate expressions would be corrupted by unreliable estimation noise, especially in beginning and at the end of the fed-batch process where the observability of the specific reaction rates is bad due to a low viable cell density.
With the selected EM and fitted kinetics, the dynamic process model for fermentation processes is ready to be used for process design, optimization, model-predictive control, or other purposes.

| CONCLUSION
This paper presents methods for the selection of small sets of EMs and for the estimation of reaction rates from noisy concentration measurements. We propose extensions to the method of Leighty and Antoniewicz (2011)  It could be shown in a simulation study that the DMFA for EM method is superior for estimating EM reaction rates from noisy concentrations measurements.
The presented algorithm for the selection of small subsets of EM was used to select a suitable set of EM for a model of a CHO fedbatch fermentation using real-world data. It could be shown that the metabolism can be described by 10 EM with an acceptable accuracy at the experimental conditions.
Using a bootstrap resampling method, the confidence intervals of the EM reaction rates of this set were calculated. This information can then be used for the choice and fitting of kinetic equations.

ACKNOWLEDGMENTS
The experimental data was provided by the Upstream Develop- Given initial conditions and a feeding profile V t Feeḋ ( ), the profile of the concentrations A t ( ), C t ( ), and X t v ( ) as well as the reaction rate vector t ν ( ) can be simulated using an ODE solver. In this case study, the implicit Matlab® solver ode15s was used.
The network can be decomposed into EM. The internal and external stoichiometric matrices N and P are: The conversion factor f 0.1 x = (cf. Equation (17d)) is included in the P matrix. The EM matrix E was calculated with metatool (Pfeiffer et al., 1999) from N (with reaction 1 and reaction 2 being irreversible): Using Equations (1) and (8), the set of differential equations describing the concentrations of components in the medium is: The matrix P E ⋅ contains the stoichiometry of all EM wrt. the external concentrations. All columns are scaled, such that the norm equals one. The death rate is added as a fourth column. Equation (21) is then extended to: To counteract the effect of measurement errors, the interpolation of concentration measurements is usually smoothed. We compare three different interpolation settings with splines, using the MATLAB® function spaps with different tolerance settings (0, c 15 i ⋅¯, c 50 i ⋅¯). The resulting rates q t i ( ) are then filtered with a moving average filter (MA) using either 30 or 90 values (where each value accounts for 1 hr of process time).
Estimation of EM reaction rates using specific uptake-and secretion rates The matrices P and E are built from the metabolic network (cf. appendix: generation of artificial fed-batch data) and # denotes the Moore-Penrose pseudo-inverse.

DMFA for EM
Before the data was analyzed, pseudo-batch data was calculated from the fed-batch data by eliminating feeding influences.
The tuning of the DMFA for EM method is carried out by using cross-validation. Here, the estimation problem (11) was solved with a different number of inflection time-points and regularization coefficients α. A few randomly selected blocks of measurements were excluded from the estimation. The mean SSR value of this test-sets was then analyzed and used as a measure of an appropriate selection of the number of inflection time-points and α. It was found that 5 inflection points with a regularization parameter of 10 α ≈ are the optimal choice in this case study.

Comparison of methods
To evaluate the effect of random measurement noise, it is assumed that the measurements are subject to Gaussian noise with zero mean and a standard deviation which is dependent on the magnitude of the measurements (such that the width of the 95% confidence interval is, e.g., 5% of the magnitude of the measurements). The sampling and the estimations were repeated 100 times at different levels of noise.