pyFOOMB: Python framework for object oriented modeling of bioprocesses

Abstract Quantitative characterization of biotechnological production processes requires the determination of different key performance indicators (KPIs) such as titer, rate and yield. Classically, these KPIs can be derived by combining black‐box bioprocess modeling with non‐linear regression for model parameter estimation. The presented pyFOOMB package enables a guided and flexible implementation of bioprocess models in the form of ordinary differential equation systems (ODEs). By building on Python as powerful and multi‐purpose programing language, ODEs can be formulated in an object‐oriented manner, which facilitates their modular design, reusability, and extensibility. Once the model is implemented, seamless integration and analysis of the experimental data is supported by various Python packages that are already available. In particular, for the iterative workflow of experimental data generation and subsequent model parameter estimation we employed the concept of replicate model instances, which are linked by common sets of parameters with global or local properties. For the description of multi‐stage processes, discontinuities in the right‐hand sides of the differential equations are supported via event handling using the freely available assimulo package. Optimization problems can be solved by making use of a parallelized version of the generalized island approach provided by the pygmo package. Furthermore, pyFOOMB in combination with Jupyter notebooks also supports education in bioprocess engineering and the applied learning of Python as scientific programing language. Finally, the applicability and strengths of pyFOOMB will be demonstrated by a comprehensive collection of notebook examples.


INTRODUCTION
Biotechnological production processes leverage the microorganisms' synthesis capacity to produce complex molecules that are hardly accessible by traditional chemical synthesis. Importantly, modern genetic engineering methods allow for targeted modification of single enzymes and whole metabolic pathways for biochemically accessing value-added compounds beyond those naturally available. However, to render the production of a target compound economically feasible, a suitable bioprocess needs to be developed which fits to an engineered microbial producer strain. In this context, computational modeling approaches utilize existing knowledge on strain and process dynamics, giving rise to modern systems biotechnology. Once a digital representation of a biotechnological system has been implemented, in silico optimizations can be performed to design an improved bioprocess, effectively reducing the number of wet-lab experiments. With the availability of new experimental data the computational model can be refined to increase its predictive power towards an optimal bioprocess. Considering the highly interdisciplinary nature of systems biotechnology requiring expertise in (micro-)biology, process engineering, computer science, and mathematics, it becomes obvious that rarely a single person can have a deep knowledge in all these fields. The more specialized and performant a bioprocess model is intended to be, the higher the knowledge level needed by the user. This may prevent non-experts in modeling and programing from dealing with these highly rewarding topics. Consequently, there is a need for tools that can be quickly learned and applied by non-experts, with the development of additional skills determined by demand.
Here, we present the pyFOOMB package that enables the implementation of bioprocess models as systems of ordinary differential equations (ODEs) via the multipurpose programing language Python. Based on the objectoriented paradigm, pyFOOMB provides a variety of classes for the rapid and flexible formulation, validation and application of ODE-based bioprocess models. Table 1 gives a comparative, non-exhaustive overview of software packages that are suitable for bioprocess modeling. These tools were developed with partly other application areas in mind, e.g., modeling and analysis of biochemical networks or simulation of chemical engineering unit operations. Consequently, these software packages require different levels of programing skills and some domain-specific knowledge for accessibility. Therefore, a major driver to establish pyFOOMB was to provide a flexible modeling tool that requires only basic programing knowledge and thus shows low hurdles for beginners in bioprocess modeling. The latter is supported by a comprehensive collec-

PRACTICAL APPLICATION
Based on the powerful, yet beginner-friendly Python programing language, the pyFOOMB package addresses a wide range of users to implement bioprocess models with growing complexity. ODE models can be formulated in an objectoriented manner, which facilitates their modular design, reusability and extensibility. pyFOOMB supports the modeling of discrete behaviors in process quantities, which is an important feature for the simulation and optimization of fed-batch processes. The concept of model replicates and definition of local and global parameters mirrors the iterative nature of data generation from cycles of experiment design, execution, and evaluation. Moreover, seamless integration with existing and future Python packages for scientific computing is greatly facilitated. Most importantly, the applicability and strengths of pyFOOMB is demonstrated by a comprehensive collection of notebook examples. tion of ready-to-use working examples which come along with pyFOOMB.
Due to the full programatic access to Python, complex models can also be implemented. Furthermore, great importance was given to convenient visualization methods that facilitate the understanding of qualitative and quantitative model behavior. Finally, the enormous popularity of Python as the de facto standard language for data science applications makes it easy to integrate pyFOOMB with other advanced tools for scientific computing.

MAIN FUNCTIONALITIES OF pyFOOMB FOR BIOPROCESS MODELING
Bioprocess models are implemented as ODEs for the timedependent variables ( ): which depend on model parameters and initial values 0 . In practice, some of the variables might not be directly measurable. Therefore, observation (or calibration) functions ( ) can be defined that relate these variables to the observable measurements, thus introducing additional parameters into the model. In order to make the user familiar with our pyFOOMB tool, a continuously growing collection of Jupyter notebook examples is provided. These demonstrate basic functionalities and design principles of pyFOOMB and serve as blueprint for the rapid set up of case-specific bioprocess models (Table A1).

MODELING WORKFLOW WHEN USING pyFOOMB
In the following we present a typical workflow for implementing and applying bioprocess models with pyFOOMB ( Figure 1). Throughout this section the toy model of

Model definition
In a first step, the targeted model and its parametrization is implemented by creating a user-specific subclass of the provided class BioprocessModel ( Figure 2B). This basic class provides all necessary methods and properties to run simulations for the implemented model. Essentially, the abstract method rhs() must be formulated by the user. Noteworthy, the pyFOOMB package does not allow for consistency checking of units for the state variables or model parameters. This responsibility is left to the user while formulating a model, i.e., before coding the model as BioprocessModel subclass.

Discrete behavior
To monitor and control the dynamics of specific model variables so-called state_events() and change_ states() methods can be defined. This is for example required for the modeling of multi-phased processes such as fed-batch with event-based changes in feeding regimes.

Observation of model states
In order to connect the model variables to measurable quantities, an ObservationFunction can be created, with the mandatory implementation of the observe() method for each relevant calibration function. Noteworthy, a variable's state can be linked to different observation functions, reflecting the fact that there are typically several analytical methods available for one specific bioprocess quantity. This approach allows to separate the bioprocess model from corresponding observations functions and thus, increases re-usability of the different parts. By deriving initial guesses for the parameters, a simulation from the model is typically used to verify the intended qualitative behavior in comparison to the experimental data.

Global and local parameters
A key feature of pyFOOMB is the possibility to integrate measurement data from independent experimental runs (replicates) by creating a corresponding number of new instances of the same model. These can still share a common set of model parameters that are defined as "global", but at the same time differ in some other "locally" defined parameters. Typical global parameters of an ODE-based bioprocess model are the maximum specific growth rate max or the substrate specific biomass yield X/S , while all initial values are reasonable defined as local parameters (see Application example II). Different values for the local parameters reflect biological or experimental variability that may arise from slight deviations in preparing, running or analyzing each replicate experiment. Alternatively, such variability might be introduced by purpose when conducting replicate experiments with intentionally very different starting conditions. The latter refers to a classical design-of-experiment approach aiming for experimental data with a maximum information gain with respect to the global parameters.

Working with the model
The implemented model (including an initial parametrization) is passed to the instantiation of the Caretaker class ( Figure 1). During the instantiation procedure several sanity checks run in the back and, in case of failure, direct the user to erroneous or missing parts of the model. The resulting object exposes important and convenient methods typically applied for a bioprocess model, such as running simulations, setting parameter values, calculating sensitivities, estimating parameters, and managing replicates of model instances.

Simulation
For a certain set of model parameters the time-dependent dynamics of the model variables and corresponding observations are obtained by running a simulation (cf. Figure 1). Integration of the ODE system is delegated to the well-known Sundials CVode integrator with event detection [9]. Its Python interface is provided by the assimulo package [10], which implements seamless event handling hidden from the user. Running some simulations with subsequent visualization is a convenient approach to verify the qualitative and quantitative behavior of the implemented model ( Figure 2C). pyFOOMB provides a class with convenient methods for that purpose, e.g., plotting of time series data covering model simulations and measurement data, corner plots for one-by-one comparison of (non-linear) correlations between parameters from Monte-Carlo sampling as well as visualization of the results from sensitivity analysis.

Sensitivity analysis
Local sensitivities ( )∕ are available for any model response (model state or observation) with respect to any model parameter (including ICs and observation functions). The sensitivities are approximated by the central difference quotient using a perturbation value of ℎ ⋅ (1, | |). Sensitivities can also be calculated for an event parameter that defines implicitly or explicitly a point in time where the behavior of the equation system is changed (cf. Figure 3A). This is useful for, e.g., analyzing induction profiles of gene expression or irregular pulsed additions of nutrients.

Parameter estimation
Finding those parameter values for a model that describe a given measurement dataset best is implemented as a typical optimization problem. Here, the estimate_parallel() method is the first choice, because it employs performant state-of-the-art metaheuristics for global optimization, which are provided by the pygmo package [11]. In contrast to local optimization algorithms, there are no dedicated initial guesses needed for the parameters to be estimated ("unknowns"). Instead, lower and upper estimation bounds are required. As a good starting point such bounds can be derived from explorative data analysis (see Application example II), literature research, or expert knowledge by simply assuming three orders of magnitude centered around the precalculated or reported parameter value.
In principle, the pyFOOMB package allows to estimate values for any model parameter, initial value, and observation parameter. Of course, a successful parameter estimation depends on sufficiently informative measurements and on the structure of the model itself. To reduce the dimensionality of the underlying optimization task values can be fixed, e.g., based on expert knowledge or literature data. Furthermore, model reformulation or simplification can be considered to reduce complexity, and here the model family concept (see below) allows a direct comparison of different model variants.
Noteworthy, pygmo provides Python bindings to the pagmo2 package written in C++. It implements the asynchronous generalized island model [12], which allows to run several, different algorithms cooperatively on the given parameter estimation problem. As an inherent feature of this method, an optimization run can be executed for a given number of so-called "evolutions" and after inspection of the results, the optimization can be continued from the best solution found so far ( Figure 3B). This powerful approach allows to traverse multi-modal, non-convex optimization landscapes.
Currently, the maximum likelihood estimators (covering its classical variants least-squares and weighted-leastsquares) are implemented. In general, a parameter vectorî s to be found that minimizes a certain optimization (loss) function. For example, for the negative log-likelihood (NLL) function for normally distributed measurement errors it holds: Given a specific measurement̂, , , for each corresponding model response at sampling time point and replicate , the NLL is calculated and summed up. By default, it is assumed that all measurements follow normal distributions based on mean values and corresponding standard deviations. The log-likelihood function is constructed by pyFOOMB when starting the parameter estimation procedure. For the case that measurements are assumed to follow other distributions, this can be specified when creating the Measurement object and pyFOOMB will take care for the definition of the correct log-likelihood function.
Noteworthy, it is not required to provide complete measurement datasets, i.e., a specific replicate may contain only one measurement or even unequal data points for different model responses.

Uncertainty analysis
An approximation of the parameters' variance-covariance matrix is provided by inversion of the Fisher informa-tion matrix, which is calculated from local sensitivities (see above). Besides, non-linear error propagation is available by running a repeated parameter estimation procedure starting from different Monte-Carlo samples (so called "parametric bootstrapping", Figure 3C). A parallelized version of this method is provided based on the pygmo package.

Result visualization
Following parameter estimation and uncertainty analysis via parametric bootstrapping, (non-)linear correlations between each pair of parameters can be readily visualized with the method show_parameter_distributions(). In addition, results are typically inspected by visualizing the set of model predictions according to the calculated parameter distributions. Using the compare_estimates_many() method, a direct comparison between measurements and repeated simulations is possible, which makes it easier to assess the validity of the model.

Implementation of model variants
Usually, when starting to formulate a bioprocess model there is not only one option to link a specific rate term with a suitable kinetic model. Depending on how informative the available measurements are in relation to the unknown kinetics, it could make sense to directly start the whole workflow by setting up a "model family". Following the object-oriented approach of pyFOOMB, a model family can be easily set up based on inheritance ( Figure 4A). In principle, for each relevant part of the original model additional submodels can be introduced by declaring separate methods. In a programing context, this approach is also known as "method extraction", as the calculations in question are extracted into further dedicated methods. The model family is then realized by building on a common model structure encoded in the BaseModel and a set of subclasses encoding the specific submodels. On a technical level, the definition of "abstract" methods is required to enforce the individual members of the model family to implement their specific submodel.
In an extended version of the running example, the rhs() method of the BaseModel class now depends on the two additional methods get_r1() and get_r2() to separate the calculation of rates 1 and 2 , respectively (Figure 4B). The latter is declared as an abstract method to enable a family of models (ModelVariant01-03) for comparing different rate expressions of 2 .

APPLICATION EXAMPLE I: SMALL-SCALE REPETITIVE BATCH OPERATION
In the first example workflow specific growth rates within an Adaptive Laboratory Evolution (ALE) process are determined. ALE processes utilize the natural ability of microorganisms to adapt to new environments to improve certain strain characteristics, such as growth on a specific carbon source.
Here, a Corynebacterium glutamicum strain which was able to slowly ( max < 0.10 h -1 ) utilize d-xylose, was cultivated repeatedly in defined medium containing d-xylose as sole carbon and energy source. The cultivation was done in an automated and miniaturized manner, delivering a biomass-related optical signal, "backscatter", with a high temporal resolution. This signal was used to automatically start a new batch from the previous one, as soon as a backscatter threshold was reached. The threshold was deliberately chosen to be in the midexponential phase, where no substrate limitation was to be expected. Six individual clones were cultivated over one preculture and seven repetitive batches, as shown in Figure 5A.

Model development
In order to keep the number of parameters and computation times as low as possible, a rather simple bioprocess model as shown in Figure 5B was employed. Growth is determined solely by the growth rate . Substrate limitations are not taken into account, since the experimental design (see above) should avoid these sufficiently. Biomass is not measured directly, instead, backscatter is introduced to the model via an ObservationFunction. This function describes a linear relationship between backscatter and biomass and takes the blank value 0 of the signal into account. A relative measurement error for the backscatter signal of 5% is assumed based on expert knowledge. The model describes the whole ALE process for each clone, not an individual batch. Therefore, state events are used to trigger a state change of , where is multiplied by a dilution factor dil . Additionally, the maximum growth rate parameter is switched for each repetitive batch. As a result, an individual max for each repetitive batch and each clone is gained. Since initial inoculation of the different clones and the inoculation procedure within the experiment was the same for all, initial biomass concentration 0 and dilution factor dil are considered as global parameters.

Parameter estimation and uncertainty analysis
In total, model parameters for six clones are estimated, which form six replicates in the context of pyFOOMBs modeling structure. For each clone, seven maximum growth rates are to be determined, plus 0 , dil , and 0 as global parameters, thus 44 parameters in total. Parallelized MC sampling was used to obtain distributions for all parameters. Results are shown in Figure 5C and D.
The estimated backscatter signals follow the actual data closely, resulting in narrow distributions for the parameters of interest, the individual max values for each clone and repetitive batch. For example, clone F starts with growth rates of 0.071 ± 0.005 h -1 to 0.086 ± 0.005 h -1 for the first four batches. In the fifth batch, a notable raise in maximum growth rate to 0.122 ± 0.008 h -1 is visible, indicating one or more beneficial mutation events. Finally, clone F reaches a growth rate of 0.212 ± 0.013 h -1 . Overall, the estimated growth rates are in good agreement with findings from the original paper.
In another style of ALE experiment, which is not subject in this study, a subpopulation of cells with beneficial mutations was enriched, yielding strain WMB2 , which is analyzed in the second application example.

APPLICATION EXAMPLE II: LAB-SCALE PARALLEL BATCH OPERATION
In this example workflow some KPIs of an engineered microbial strain cultivated in a bioreactor under batch operation are determined. Often, such KPIs represent process quantities that are not directly measurable (e.g., specific rates for substrate uptake, biomass and product formation) and therefore have to be estimated using a model-based approach.
The data originates from two independent cultivation experiments with the evolved C. glutamicum strain WMB2 as introduced before [13]. Following successful adaptive laboratory evolution this strain has now improved properties for utilizing d-xylose as sole carbon and energy source for biomass growth. At the same time the strain produces significant amounts of d-xylonate, a direct oxidation product of d-xylose.

Explorative data analysis and model development
Before implementing a suitable bioprocess model with pyFOOMB, the data from one replicate bioreactor cultivation is visualized and used for explorative data analysis. In Figure 6A, the time courses of biomass ( ), d-xylose ( ), and d-xylonate ( ) are presented in one subplot. It can be seen that biomass formation stops with depletion of dxylose and, thus, modeling the cell population growth by a classical Monod kinetic is reasonable ( Figure 6B). The formation of d-xylonate is also strictly growth-coupled, leading to a simple rate equation with the yield coefficient P/X as proportionality factor. Finally, the d-xylose uptake rate equals the combined carbon fluxes into biomass and dxylonate, which are related to the yield coefficients X/S and P/S respectively.
The time courses of substrate and product are measured in molar concentrations, while the bioprocess model is formulated using mass concentrations of the respective species. The mappings are realized by defining corresponding observation functions ( Figure 6C).
Finally, the strain-specific parameters like max and X/S are defined as global parameters, while experiment-specific parameters (ICs for biomass and substrate ) are defined as local parameters since the cultivation media and inoculation material were prepared individually for each reactor. Please note, even this very simple process model now already contains eight model parameters (i.e., three ICs and five kinetic parameters) that have to be estimated from the given measurements.

Parameter estimation and uncertainty analysis
In order to facilitate the parameter estimation problem, good initial guesses for all parameter values are important. First approximations for max as well as all yield coefficients can be derived by following ordinary and orthogonal distance regression analysis on the raw data assuming linear relationships ( Figure 6A). For Python, corresponding methods are available from the NumPy [14] and SciPy [15] packages.
From the obtained initial guesses corresponding parameter bounds are fixed to run a parallel parameter estimation procedure ( Figure 7A). As a result, a first set of best-fitting parameter values is obtained from which new bounds can be derived for the subsequent uncertainty analysis using again parallelized MC sampling. Corresponding results are summarized in Table 2. The pair-wise comparison of parameter distributions shown in Figure 7B reveals a distinct non-linear correlation between the yield coefficients P/S and X/S . This effect is expected due to the formulation of the biomassspecific substrate consumption rate ( Figure 6B). Equal values for can be derived for different combination of substrate conversion rates into biomass and product, and the yield coefficients are the corresponding scaling factors. The latter is also the reason why the estimated yield coefficients are significantly higher as compared to the explorative data analysis, which does not allow this separation and therefore leads to false-to-low predictions ( Table 2 and Figure 6A).
Finally, the estimated biomass yield X/S for d-xylose is close to the value reported for the wild-type strain growing on d-glucose, i.e., 0.63 [CI: 0.58-0.69] vs. 0.60 ± 0.04 g g -1 [16]. This indicates a comparable efficiency of C. glutamicum WMB2 in utilizing d-xylose for biomass growth.

CONCLUSIONS
The pyFOOMB package provides straight-forward access to the formulation of bioprocess models in a programatic and object-oriented manner. Based on the powerful, yet beginner-friendly Python programing language, the package addresses a wide range of users to implement models with growing complexity. For example, by employing event methods, pyFOOMB supports the modeling of discrete behaviors in process quantities, which is an important feature for the simulation and optimization of fedbatch processes. The concept of model replicates and definition of local and global parameters mirrors the iterative nature of data generation from cycles of experiment design, execution and evaluation. Moreover, seamless integration with existing and future Python packages for scientific computing is greatly facilitated. In summary, pyFOOMB is an ideal tool for model-based integration and analysis of data from classical lab-scale experiments to state-of-the-art high-throughput bioprocess screening approaches.

AVAILABILITY
The source code for the pyFOOMB package is freely available at github.com/MicroPhen/pyFOOMB. It is published under the MIT license. Currently, its compatibility is tested with Python 3.7 and 3.8, for Ubuntu and Windows operating systems. The use of pyFOOMB within a conda environment is recommended, since the most recent versions of important dependencies are maintained at the condaforge channel.

A C K N O W L E D G E M E N T S
This work was partly funded by the German Federal Ministry of Education and Research (BMBF, projects: "Digitalization in Industrial Biotechnology", grant no. 031B0463 and "BioökonomieREVIER_INNO: Entwicklung der Modellregion BioökonomieREVIER Rheinland", grant no. 031B0918A). Further funding was received from the Bioeconomy Science Center (BioSC, Focus FUND project "HyImPAct -Hybrid processes for important precursor and active pharmaceutical ingredients", grant no. 313/323-400-00213).

C O N F L I C T O F I N T E R E S T
The authors have no conflict of interest to declare. Implements several example bioprocess models, serving as starting point for implementation of user-specific ones.

O R C I D
7

Fed-batch models
Demonstrates the implementation of fed-batch bioprocess models at various complexities. Shows how to use the models to derive further performance indicators such as maximum yield and productivity and how to get these from a model parameters' search. In addition, the formulation of the corresponding optimization problem is presented.

Measurement data
Loading measurement data from spreadsheet files with subsequent creation of "Measurement" objects, focusing on three use cases that are based on: 1) Individual time vectors of varying lengths, with a shared time vector; 2) A shared time vector but missing data points for several measurements, and 3) Multi-replicate experiments with a shared time vector but missing data points.

9
Parallel parameter estimation (PPE) Introduces PPE and the concept of continuation of an estimation job.

PPE -Optimizer comparison
Compares different optimization algorithms for PPE of a simple bioprocess model utilizing artifical noisy data. Comparison is based on runtimes and achieved losses for the given optimization problem.

PPE -Hyperparameter adjustment
Demonstrates how different parameter settings of the "de1220" and "compass_search" algorithms affect runtime and quality of the model calibration outcome.

PPE -Monte Carlo sampling
Introduces the application of PPE for Monte-Carlo sampling as method for non-linear error propagation.

PPE -Monte Carlo Sampling
In addition to the previous examples, the possibility to apply further constraints (beyond simple box bounds for parameters) is demonstrated.
14 Tracking specific rates during integration Shows how specific rates can be derived and visualized as time-dependent performance indicators.

15
Non-negative states For enforcing non-negative state values, events can be employed. Without, state values can take very small but negative numbers due to the operation mode of the integrator, which treats those values internally as zero (depending on the specified tolerances).