Combining model structure identification and hybrid modelling for photo‐production process predictive simulation and optimisation

Integrating physical knowledge and machine learning is a critical aspect of developing industrially focused digital twins for monitoring, optimisation, and design of microalgal and cyanobacterial photo‐production processes. However, identifying the correct model structure to quantify the complex biological mechanism poses a severe challenge for the construction of kinetic models, while the lack of data due to the time‐consuming experiments greatly impedes applications of most data‐driven models. This study proposes the use of an innovative hybrid modelling approach that consists of a simple kinetic model to govern the overall process dynamic trajectory and a data‐driven model to estimate mismatch between the kinetic equations and the real process. An advanced automatic model structure identification strategy is adopted to simultaneously identify the most physically probable kinetic model structure and minimum number of data‐driven model parameters that can accurately represent multiple data sets over a broad spectrum of process operating conditions. Through this hybrid modelling and automatic structure identification framework, a highly accurate mathematical model was constructed to simulate and optimise an algal lutein production process. Performance of this hybrid model for long‐term predictive modelling, optimisation, and online self‐calibration is demonstrated and thoroughly discussed, indicating its significant potential for future industrial application.

developed by embedding new physical understandings (e.g. adding the Aiba model for light intensity effects, and the Beer-Lambert's law for light attenuation) into classic models such as the Monod and the Droop model (Fouchard, Pruvost, Degrenne, Titica, & Legrand, 2009;Packer et al., 2011;Ryu et al., 2018). However, identifying a correct model structure to describe the physical knowledge is a challenging task, usually with long development times for parameter estimation and model structure selection. This often results in a complex model structure leading to issues with parameter estimation and further use in process optimisation (do Carmo Nicoletti & Jain, 2009). Moreover, as there is still critical biological knowledge yet undetermined, pure kinetic models cannot provide satisfactory accuracy for process long-term prediction, sacrificing their industrial applicability.
On the other hand, machine learning-based data-driven models such as artificial neural networks (ANNs) and Gaussian processes (GPs) have been applied in parallel to avoid the above problems for the predictive modelling and optimal control of photo-production processes (Bradford, Schweidtmann, Zhang, Jing, & del Rio-Chanona, 2018;Liyanaarachchi, Nishshanka, Nimarshana, Ariyadasa, & Attalage, 2020). Although these data-driven models can better capture the complex process dynamics in a specific operating range without the need of prior physical knowledge, they cannot be extrapolated to a wider operating range and are prone to overfitting (Bernard, Dochain, Genovesi, Gouze, & Guay, 2008;do Carmo Nicoletti & Jain, 2009). Most importantly, having a large size (e.g., hundreds to thousands) of data is a prerequisite for data-driven model construction. However, as photo-production processes are slow dynamic systems (e.g., general algae and cyanobacteria doubling time~20 and 7-12 hr, respectively, compared to bacteria (~0.5 hr) and yeast (~1.5 hr; Bernstein et al., 2016;Liu, Huang, & Che, 2011), it usually takes several weeks or months to complete one experiment for data collection. Hence, acquiring substantial data for photoproduction systems modelling is a severe practical obstacle.
In addition, after decades of research, substantial physical information has been discovered for photo-production processes. Thus, pure data-driven models, which are mainly used for systems with little mechanistic information, are not suited for photo-production systems. To this end, hybrid modelling, which aims to integrate prior domain knowledge and machine learning techniques, provides a potentially better simulation strategy. The concept of hybrid modelling was proposed in 1992 (Psichogios & Ungar, 1992) for fermentation process control, and the methodology was developed since the 2000s (Feyo de Azevedo, Dahm, & Oliveira, 1997;Oliveira, 2004).
The structure of a hybrid model is flexible depending on the approach to combine data-driven models and mechanistic models (von Stosch, Oliveira, Peres, & Feyo de Azevedo, 2014). In recent years, there have been several studies attempting to apply this strategy into biochemical and general chemical engineering applications (Bangi & Kwon, 2020;Carinhas et al., 2011;Portela, von Stosch, & Oliveira, 2018).
However, all of the previous studies exclusively used ANNs as the data-driven part for hybrid model construction, resulting in two issues. First, embedding ANNs into a kinetic model can impede the numerical calculation accuracy during dynamic parameter estimation.
Although it is still feasible to estimate parameters in the kinetic part (Graefe et al., 1999), the optimisation solution quality can be compromised because of the more nonlinear and complex mathematical structure of the hybrid model (Hamilton, Lloyd, & Flores, 2017;von Stosch, 2011). This can be more problematic if several ANNs are embedded into a kinetic model to substitute its original physical parameters.
Secondly, distinct from the kinetic models used in chemical engineering, most kinetic models used in bioprocesses, in particular photo-production processes, are slightly over-fitted (partially nonidentifiable, meaning that some parameters have a large confidence interval or high uncertainty) due to the lack of experimental data which is time consuming to obtain (Kiparissides, Koutinas, Kontoravdi, Mantalaris, & Pistikopoulos, 2011;Lee, Ding, Jayaraman, & Kwon, 2018). Fully identifiable kinetic models which remove parameters that are numerically nonidentifiable but have unique physical meanings are often found to have worse fitting results and unsatisfactory predictions for practical use given their oversimplified model structure (Bernard et al., 2008). Hence, using mildly overfitted kinetic models for bioprocess simulation is inevitable to balance the trade-off between physical insight and mathematical accuracy. Nonetheless, ANNs usually consist of tens or hundreds of parameters, and little effort has been made to systematically identify the suitable ANN structure inside a hybrid model. Thus, integrating an ANN into a kinetic model will aggravate the overfitting risk, jeopardising the hybrid model's predictive capability.
As a result, this study aims to develop a more flexible hybrid modelling strategy to combine prior physical knowledge with datadriven techniques. In addition, automatic model structure identification will be applied into this strategy to effectively determine the best model structure with the minimum number of data-driven parameters. As hybrid modelling has never been applied to photoproduction processes, this study will use microalgal lutein production as a case study to demonstrate the practical use and performance of this hybrid modelling approach.

| Computational experiment setup
To illustrate the proposed model construction framework and test the efficiency of the hybrid model in process prediction and optimisation, computational experiments were used in this study. Algal biomass growth and lutein synthesis are mainly affected by light intensity and nitrate concentration (Xie et al., 2013). A complex kinetic model proposed previously was used to generate computational experimental data and is presented in Equation (1a)-(1f) (del Rio-Chanona, Ahmed, Zhang, Lu, & Jing, 2017). This model can adequately simulate effects of light intensity, light attenuation, and nitrate supply on biomass growth and lutein production. However, given its complex model structure, its application in process ZHANG ET AL.
| 3357 optimisation and bioreactor design is limited, and identifying its model structure is time consuming. The reasons to use computational experiments are that this enables the estimation of "best" process performance (i.e. optimal process calculated by the complex process model) which can be adopted as a benchmark to evaluate the hybrid model's efficiency, as well as being more time efficient to generate data compared to physical experiments.
where c X , c N , and c L are the concentrations of biomass, nitrate, and lutein, respectively. F in and , c N in are nitrate inflow rate and concentration, respectively. u m is maximum cell-specific growth rate, respectively. k s and k sL are light saturation terms for cell growth and lutein synthesis, respectively. k i and k iL are light inhibition terms for cell growth and lutein synthesis, respectively. τ is cell absorption coefficient, K a is the bubble scattering coefficient. l is the distance from light source, z is the width of the reactor (0.084 m). I n is the local light intensity at a distance of ( = … ) • n 1, , 10 n z 10 m away from the incident light surface area, while I 0 is the incident light intensity.
Parameter values can be found in del Rio-Chanona et al. (2017).
This complex model acts as the "true" experimental system.
Initially, three computational experiments (Experiments 1-3) were conducted under a broad spectrum of operating conditions from nitrate-limiting and photo-limitation to nitrate-excessive and photoinhibition as listed in Table 1. Samples were collected once per 12 hr and data generated in these experiments was used to construct the hybrid model. In total, 3 data sets were produced with 11 data points in each data set. This is to mimic real experimental implementations and ensure that the computational experiments are as realistic as possible (Xie et al., 2013). Once constructed, the hybrid model was first exploited to predict and optimise several fed-batch processes (offline in Table 1) and was subsequently verified via computational experiments and compared against the best process performance.
Finally, the hybrid model's self-calibration efficiency was also evaluated through an online experiment (online in Table 1).

| Hybrid model construction
As stated previously, the complexity and nonlinearity of a bioprocess consists both of identified physical mechanisms and undetermined process dynamics. Therefore, as shown in Equation (2), the principle of a hybrid model is to quantify the underlying bioprocess behaviour by using a kinetic model ( ) S K to tackle the known dynamics (physical knowledge) and a data-driven model ( ) S D to account for the unknown dynamics. Depending on the amount of known physical knowledge and process data, the structure of ( ) represent the concentrations of biomass, nutrient, and product re- Initial lutein production (mg/L) 0.0 0.0 0.0 0.0 0.0 Operating time (hr) 120 120 120 120 120 Note: Experiments 1-3 (batch processes), used for parameter estimation; offline (fed-batch processes): initial conditions of the four offline optimisation processes; online (fed-batch process): initial conditions of the model self-calibration experiment. Moreover, a 2nd degree PR is also known as an extension of the Lotka-Volterra model (Guerra, 2014), a classic model used to understand the interactions of communities in microbiome studies.
More importantly, parameter estimation and uncertainty analysis of a PR integrated hybrid model is more straightforward to execute compared to an ANN embedded complex hybrid model, as extracting gradient information is not trivial in ANNs, as is their parameter estimation and optimisation. This feature is particularly advantageous for industrial applications, as estimating model uncertainty is crucial in bioprocess operation and decision-making.
When considering the kinetic part K(S), its aim is to approximate the process trajectory. As a result, only light intensity, light attenuation, and nitrate supply are considered when building K(S).
Other factors such as photo-inhibition, biomass decay, and lutein where the subscript K refers to the kinetic part of the hybrid model.

| Automatic model structure identification
As discussed before, the two challenges for hybrid model construction are identifying the physically correct model structure of the kinetic part and preventing the overfitting of the data-driven part. To address these two challenges simultaneously, an automatic model structure detection strategy which comprises of model reformulation and sparse identification of nonlinear dynamics (SINDy, a mathematical algorithm that aims to identify the minimum number of mathematical terms which can accurately regress a given data set) is proposed in this study. The original idea of this strategy was developed to reveal natural laws based on experimental data using basic mathematical operators without any prior knowledge (Schmidt & Lipson, 2009). This idea was then applied to nonlinear dynamic systems in 2016 to identify the governing equations for fluid dynamics (Brunton, Proctor, & Kutz, 2016). The current study aims to modify this strategy and adopt it into photo-production bioprocess hybrid modelling. More mathematical theory background of model structure detection and sparse optimisation can be found in (Bhadriraju, Narasingam, & Kwon, 2019). Detailed implementation of this strategy is explained below.
Initially, the original hybrid model is reformulated into a mixedinteger nonlinear programming (MINLP) problem by assigning binary variables into each individual term in the hybrid model, as presented in Equation (4a)-(4c). By adding two equality constraints equation (4d)-(4e), different kinetic equations for nitrate uptake and lutein production are merged into one expression shown as Equation (4b) and (4c), respectively. As binary variables can only be either 0 or 1, imposing the equality constraints will ensure that only one of the ZHANG ET AL.
| 3359 possible kinetic formulations (e.g. r r and C C N1 N2 ) is active, with the other being inactive as its binary variable is 0. Through this way, kinetic model discrimination can be executed more efficiently than the trial and error approach. This method can also be applied to general cases in which multiple contradictory kinetic hypotheses are present. Data nominalisation must be executed before using the polynomial regression model. However, this is not needed for kinetic model construction, as each parameter in the model has its unique physical meaning and unit.  respectively. r C N1 and r C N2 are the first term on the right-hand side of Equations (3b) and (3c), respectively. r r and C C 1 L L2 are the first term on the right-hand side of Equation (3d) and (3e), respectively.
Once reformulated, the next step is to implement sparse optimisation to automatically identify the most probable kinetic model structure and minimum number of polynomial terms that can well fit the three experimental data sets. Thus, the objective function for parameter estimation, Equation (5), is designed such that in addition to minimising residues between model prediction and experimental data, it also penalises the total number of active binary variables in the polynomial terms to avoid overfitting. In this way, the optimal hybrid model structure alongside its parameter values can be where c i p k E , , , and c i p k M , , , are experimental and model-simulated value for concentration of state S i at time step ∈ p m in the kth data set (total number of data sets is n), respectively, w i j k , , and w b are the weight for each data point and the sum of binary variables, respectively.
In this study, parameter estimation was formulated as a weighted nonlinear least-squares optimisation problem. However, as the current model is highly nonlinear due to its complex model structure and many discrete (binary) variables, it must be relaxed so that it can be solved using the established dynamic parameter estimation method. Conventionally, an MINLP problem is relaxed via introducing new parameters and extra linear constraints to reduce nonlinearity of the original problem (del Rio-Chanona, Zhang, & Shah, 2018). For example, converting Equations (4a)-(6a) where new continuous parameters B j 1 are introduced to substitute the bilinear term ⋅ b a j j 1 1 and satisfy the constraints equation (6b) and (6c): where a j The NLP model was discretised by orthogonal collocation over finite elements in time, and was then solved using the interior point nonlinear optimisation solver IPOPT through a multi-start framework. This was executed in the Python optimisation environment Pyomo (Hart, Laird, Watson, & Woodruff, 2012). Confidence intervals were also calculated during parameter estimation, and dynamic optimisation was performed in Python following the same procedure.
Model uncertainty analysis was carried out using the standard package in the commercial software Wolfram Mathematica 12.0.

| Result of hybrid model construction
Through automatic model structure identification, the hybrid model is identified as Equation (8a)-(8c) (all inactive terms are removed; active binary variables are not shown as they are equal to 1). Table 2 3360 | shows the parameter estimation result. Figure 1 shows the modelling fitting result. The 95% confidence intervals of kinetic parameters are listed in practically non-identifiable given its high uncertainty. This further indicates that due to the sophisticated process mechanism and practical challenge in generating substantial data for photo-production processes, it is advantageous to build hybrid models that consist of a simple kinetic part and a simple data-driven part. Using complex datadriven models or elaborate kinetic models to simulate a photoproduction process will inevitably exacerbate overfitting.
From Figure 1, it is seen that the hybrid model can accurately fit all the three data sets collected over a broad spectrum of respectively. In addition, the total number of binary variables assigned to the data-driven model is 15, while only 5 are active after sparse optimisation. The kinetic formulation is also identified more efficiently than traditional model discrimination.
Compared to the complex kinetic model shown in Equation (1a)-(1d), the current hybrid model adequately fits all the data sets with a simpler model structure. As the parameter k sL accounting for lutein production is practically non-identifiable, it is found that model uncertainty is slightly higher when predicting lutein production. Overall, it is concluded that integrating automatic model structure identification into hybrid model construction is advantageous.

| Model validation and offline optimisation
To further validate accuracy of this hybrid model, it is first used to predict optimal control actions for a long-term fed-batch lutein production process, which is subsequently verified experimentally.
The objective function is to maximise the final lutein production over a 5-day operation system in which incident light intensity (100-800 μmol·m −2 ·s −1 ) and nitrate inflow rate (0-17.5 ml/hr with a concentration of 0.1 M so that total added liquid volume is negligible) can change once per day. Once optimal control actions were predicted by the hybrid model, they were implemented and verified by the computational experiment. In addition, the original complex kinetic model was also used to search for the theoretical maximum lutein production in this fed-batch process. This is used as a benchmark to test the hybrid model's efficiency. Initial operating conditions of this process are listed in Table 1. prediction is verified to be accurate. From Figure 2b,d,f, it is also observed that the hybrid model can well predict the theoretical best process behaviour. The prediction error of the hybrid model is 3.4%, 1.9%, and 3.1% for concentrations of biomass, nitrate and lutein, respectively. The maximum model prediction uncertainty is 8.7% (final lutein production). When comparing the two processes, the hybrid model is found to result in the same lutein production as the theoretical best process, indicating its high performance in offline optimisation. The two processes (identified by the hybrid model: To restrict the solution space and include more practical concerns, a new offline optimisation problem is formulated by embedding three constraints: the first being that changes between two adjacent nitrate inflow rates must be less than 2 ml/hr; the second being that final culture nitrate concentration must be lower than 600 mg/L; and the third being that total lutein production must be no lower than 4.0 mg/L. The first constraint aims to prevent drastic changes of culture environment, the second aims to reduce raw material cost, and the third aims to guarantee a similar final production to the theoretical maximum value. The initial operating conditions are set to be the same as before. Through experimental verification as shown in Figure 4, it is concluded that, once again, the hybrid model well predicted the dynamics of its optimised process. It also predicted well concentrations of biomass (prediction error 3.4%) and lutein (prediction error 3.1%) in the theoretical best process (with a maximum prediction uncertainty of 8.7% on final lutein production). Nonetheless, its prediction of nitrate concentration in the theoretical best process ( Figure 4d) is found to have large deviations (prediction error 7.1%).
In addition, its uncertainty in nitrate concentration (Figure 4c,d) increases significantly ( | 3363 update its large number of parameters. Nonetheless, offline data collected from an ongoing process is usually limited, hence not able to satisfy this prerequisite. Another great advantage of hybrid modelling is that it can solve both issues relatively easily. The kinetic part of a hybrid model does not have to be frequently calibrated as its mismatch has been included in the data-driven part. The data-driven part of a hybrid model has a simple structure with a minimal amount of parameters (e.g., five parameters in this case), thus offline data (usually sampled once per 4 hr in an industrial process, hence 6 data points/day) collected from a process can be used to update the model's accuracy on a daily basis.
Parameter re-estimation of the current hybrid model takes seconds to complete, having minimum impact on delaying process decisionmaking or estate estimation. As a result, online calibrating a hybrid model is feasible and straightforward.
To illustrate this advantage, another case study was conducted. Since the current hybrid model is accurate for long-term offline optimisation, it was decided to first add errors in its parameter values such that a large initial model-plant mismatch is observed. Therefore, a 15% random error was assigned to all the parameters, and the new model's prediction is shown in Figure 5 when only the initial operating conditions and pre-determined control actions are given. It is seen that six large mismatch exists at a later stage of the process. However, after model recalibration at the end of Day 1 (using 6 new data points collected in Day 1), it is found that the model's prediction is improved significantly, particularly for biomass (average prediction error reduced from 11.7% to 1.5%) and lutein concentrations (average error reduced from 14.3% to 2.1%). Daily calibration of the data-driven part of the model leads to moderate improvement until Day 3, beyond which this effect becomes negligible. Nonetheless, it is also observed that the hybrid model still contains an error for online state estimation and long-term prediction, particularly for nitrate concentration (average prediction error~2.7%). This is because the (e, f) Optimal nitrate inflow rate identified by the hybrid model and the theoretical best process, respectively. As the optimal control actions of light intensity are found to be the same as Figure 3c,d, they are not repeated here kinetic part of the model comprises a 15% error in its parameters and has never been calibrated. In practice, if enough new data points can be sufficiently accumulated from an ongoing process, they can be used to re-estimate values of all the parameters in the hybrid model to further improve model accuracy.

| CONCLUSION
Overall, based on the current case studies, it is concluded that by adopting the automatic model structure detection strategy, it is possible to construct a flexible hybrid model that combines both prior physical knowledge and insights from the data obtained (through machine learning techniques). Compared to a pure kinetic or a pure data-driven model, the hybrid model does not require deep physical understanding of the underlying system or a large amount of process data. Based on (computational) experimental verification, it is found that the hybrid model shows significant potential to be used for process optimisation and monitoring, and its self-calibration is easy to implement in an online operating system. Future research will be conducted to further investigate its advantages in process scaleup and reactor design.
It is important to emphasise that although this study conducted model structure identification and parameter estimation simultaneously, in practice this can be executed in sequence depending on the complexity of the kinetic part of the hybrid model. With more available physical information, it is possible to develop more accurate kinetic models that better quantify the process behaviour. The more accurate the kinetic part is, the less complex the data-driven part will be. However, solving a highly complex dynamic MINLP problem is mathematically challenging.
Thus, one can first identify the kinetic model parameters if the kinetic part is already complex (e.g., more available physical knowledge), and then use the SINDy algorithm to determine the data-driven part formulation and parameters. In addition, a hybrid model should be mainly considered when there is prior domain knowledge of the underlying process. The complexity of the kinetic part in a hybrid model is case by case, and relies on the modeller's experience. Based on the current work, we would recommend that one should only use the kinetic part to simulate primary process mechanisms, and use the data-driven part to account for secondary process mechanisms. This may help with constructing an accurate yet simple hybrid model structure.