Transforming data to information: A parallel hybrid model for real‐time state estimation in lignocellulosic ethanol fermentation

Abstract Operating lignocellulosic fermentation processes to produce fuels and chemicals is challenging due to the inherent complexity and variability of the fermentation media. Real‐time monitoring is necessary to compensate for these challenges, but the traditional process monitoring methods fail to deliver actionable information that can be used to implement advanced control strategies. In this study, a hybrid‐modeling approach is presented to monitor cellulose‐to‐ethanol (EtOH) fermentations in real‐time. The hybrid approach uses a continuous‐discrete extended Kalman filter to reconciliate the predictions of a data‐driven model and a kinetic model and to estimate the concentration of glucose (Glu), xylose (Xyl), and EtOH. The data‐driven model is based on partial least squares (PLS) regression and predicts in real‐time the concentration of Glu, Xyl, and EtOH from spectra collected with attenuated total reflectance mid‐infrared spectroscopy. The estimations made by the hybrid approach, the data‐driven models and the internal model were compared in two validation experiments showing that the hybrid model significantly outperformed the PLS and improved the predictions of the internal model. Furthermore, the hybrid model delivered consistent estimates even when disturbances in the measurements occurred, demonstrating the robustness of the method. The consistency of the proposed hybrid model opens the doors towards the implementation of advanced feedback control schemes.


| INTRODUCTION
The transition of fuels and chemicals production from nonrenewable resources to renewables is a key requirement in realizing a circular economy. An example of this transition is the production of ethanol (EtOH; as fuel and chemical platform) from renewable substrates such as lignocellulosic feedstocks, which otherwise would be discarded as waste material (Drapcho et al., 2008;Li et al., 2016). Despite the decades-long research work on lignocellulosic fermentation, the real-time monitoring of key state variables in such a fermentation process is required to successfully counter the effect of: (i) toxicity of the fermentation media derived from lignocellulosic feedstocks, (ii) the presence of a mixed carbon source in the substrate, (iii) the inherent variation of the feedstocks, and (iv) contamination that can occur in industrial settings (Cabaneros et al., 2019;Drapcho et al., 2008).
Fed-batch process operations, where the feed rate is adjusted to keep the concentration of inhibitors and glucose (Glu) inside the reactor below a certain level, can be useful to mitigate the toxic effects of the inhibitors and to promote the coconsumption of the different carbon sources (Drapcho et al., 2008;Knudsen & Rønnow, 2020;Mauricio-Iglesias et al., 2015). However, the substrate variability often results in different fermentation profiles between batches, which can result in significant operational challenges. In this context, limiting the feed rate to avoid the effect of the inhibitors, can decrease the productivity of the process, and increasing the length of the fed-batch process can increase the risk of contamination (Cabaneros et al., 2019;Eliasson Lantz et al., 2010). To optimize the cellulosic EtOH fermentation, it is necessary to develop flexible operations that are able to account for the effects of substrate variability and to react to possible process deviations in real-time (as the fermentation progresses; Eliasson Lantz et al., 2010). As a consequence, developing and implementing appropriate monitoring schemes to gain real-time information on the state of the fermentation is crucial to enable control actions and to improve the competitiveness of cellulose-to-EtOH processes (Cabaneros et al., 2019;Eliasson Lantz et al., 2010).
With the ever-increasing intention of biochemical industries to leverage data to improve process operations, there is increased interest in applying measurement methods to monitor processes in real-time (Udugama et al., 2020) Choosing a suitable monitoring method for cellulose-to-EtOH fermentations is challenging due to the high complexity of the fermentation matrix and the high concentration of suspended solids (Cabaneros et al., 2019;Eliasson Lantz et al., 2010). In practice, the commonly monitored variables in fermentation processes, for example, pH, temperature, or pO 2 often fail at delivering actionable information to design feedback control schemes (Cabaneros et al., 2019). Therefore, it can be clearly seen that more advanced measurements (e.g., of substrates, products, biomass, or inhibitors) are needed to improve the operation of cellulose-to-EtOH processes. Different measuring methods are available to monitor the compounds dissolved in the liquid phase (e.g., vibrational spectroscopy, biosensors, or at-line highperformance liquid chromatography [HPLC]) or to monitor the biomass concentration (e.g., capacitance probes or fluorescence spectroscopy; Cabaneros et al., 2019). Among the different options available, attenuated total reflectance mid-infrared spectroscopy (ATR-MIR) is an analytical tool that allows the fast and simultaneous detection of several compounds (including multiple sugars or weak acids) from the fermentation media (Cabaneros et al., 2019;Lourenço et al., 2012). Unlike other spectroscopic methods, ATR-MIR spectroscopy measures the light reflected from the sample (instead of the light transmitted through it), making it more robust and suited to monitor systems with a high concentration of suspended solids (Cabaneros et al., 2019;Lourenço et al., 2012). The collected spectra are then analyzed using data-driven methods, usually, partial least squares (PLS) regressions, to make use of the linear correlations between the concentration of the different analytes and the absorbance in the spectra (Lambert Beer's law; Lourenço et al., 2012). However, the complexity of the media and the highly correlated dynamics between the concentrations of many analytes results in complex spectra with overlapping peaks and require extensive data analysis to train reliable predictive models (Cervera et al., 2009;Krämer & King, 2016). This situation makes the measurements noisy and often unsuited for the implementation of advanced control schemes (Krämer & King, 2017).
This makes the CD-EKF an appropriate tool to monitor bioprocesses given the nonlinear kinetics of biological systems (Mauricio-Iglesias et al., 2015;Price et al., 2014;Ricardo, 2019). Similar to the KF, the CD-EKF algorithm operates iteratively in two steps: a prediction and an update step.
In this study, a hybrid monitoring approach based on CD-EKF is proposed to estimate the concentration of Glu, xylose (Xyl), and EtOH from spectroscopic measurements collected with ATR-MIR spectroscopy and to monitor the progression of cellulose-to-EtOH fermentations in real-time (von Stosch et al., 2014). The CD-EKF reconciliates the predictions made by the internal model (a kinetic model for cellulosic EtOH fermentation) with the PLS predictions of the concentrations of Glu, Xyl, and EtOH. Due to the high complexity and limited availability of fermentation media, the calibration set for the PLS models solely contained synthetic samples that were purposely planned using a design of experiments approach, and no fermentation samples were included in it. This calibration set was carried out to minimize the correlation between the concentration of Glu, Xyl, and EtOH and to distribute the leverage of each sample evenly in the design space. This developed hybrid approach provides a more stable and robust monitoring framework as it eliminates the deficiencies of a purely mechanistic or data-driven approach. That is, unmeasured process disturbances and inherent variations in biological systems can lead to significant mismatches with the kinetic model, and datadriven sensors can be noisy, difficult to interpret, and often lack extrapolation power because they ignore the dynamics of the system.
The developed approach was then applied to monitor different cellulose-to-EtOH fermentations carried out at the bench scale, and the results obtained were compared to a scenario where only measurements are used to monitor the process. 30°C and 180 rpm. One milliliter of the grown cell culture was transferred to a 250 ml shake flask containing 100 ml of YPX medium (10 g/L of yeast extract, 20 g/L of peptone, and 20 g/L of Xyl (Sigma-Aldrich) and it was grown for 36 h at 30°C and 180 rpm. One milliliter of the cell culture grown in YPX was diluted 1000 times, plated in a YPX-agar plate (YPX media with 10 g/L agar), and incubated for 36 h at 37°C before storage at 4°C.

| Cell culture propagation
A single colony of S. cerevisiae CEN. PK. XXX grown on YPX-agar plates was transferred to a 250 ml shake flask containing 100 ml of YPX media, and it was grown for 36 h at 30°C and 180 rpm. One milliliter of the grown cell culture was inoculated in a 500 ml shake flask, filled with 250 ml of YPX media, and grown for 36 h at 30°C and 180 rpm before inoculation of the fermenter. Before inoculating, the dry weight of the cell culture was measured as described in (El-Mansi et al., 2012).

| Fermentation experiments
Four batch fermentations were carried out in a 2.5 L BIOSTAT ® A bioreactor (Sartorius) with a working volume of 1.5 L (Table 1), equipped with two 6-bladed Rushton impellers, pH and temperature control. In all fermentations, the pH was controlled at 6 using 5 M H 2 SO 4 and 2 M NaOH. The temperature was kept at 30°C and the stirring rate at 450 rpm. The wheat straw hydrolysate was supplemented with 5 g/L of yeast extract and 10 g/L of peptone and centrifuged (Heraeuse Multifuge X3R; Thermo Fisher Scientific) for 10 min at 4000 rpm to reduce the concentration of suspended solid compounds to 10 g/L. In fermentations 2, 3, and 4, the media was centrifuged for another 10 min at 4000 rpm to further reduce the concentration of suspended solid compounds to 2 g/L. A volume of 1.4 L wheat straw hydrolysate (prepared as explained in the Supporting Information Material) was inoculated with 100 ml of grown cell culture. The fermentation lasted between 25 and 35 h until the Xyl was consumed. A sample of 1.5 ml was taken hourly, filtrated through a 0.20 µm cellulose acetate filter (Labsolute USA, 7699822) and stored at −20°C for off-line analysis with HPLC.

| Off-line analysis with HPLC
Glu, Xyl, EtOH, furfural (Fur), 5-hydroxymethyl furfural (5-HMF), and acetic acid (HAc) were measured off-line using an Ultimate 3000 HPLC (Thermo Fisher Scientific) with an Aminex HPX-87 H column (Bio-Rad) at 50°C with 5 mM H 2 SO 4 as eluent and a flow rate of 0.6 ml/min for 80 min. A sample volume of 950 µl was diluted with 50 µl of 5 M H 2 SO 4 before injection. All compounds were detected using the refractive index (ERC RefractoMax 520; Prague) at 50°C.  Inoculum size (g/L) whereˆ( ) t x is a vector containing the state variables of the model, ( ) t u are the external inputs, and p are the model parameters. The data-driven model consists of a set of three independent PLS models that take the spectra collected on-line with the ATR-MIR spectrometer as input and return the measured concentrations of Glu, Xyl, and EtOH as output. At time zero (t 0 ), the mechanistic model is used to generate a long-horizon prediction of the profile of the fermentation from the initial conditions (x 0 ). Then, during the fermentation, the CD-EKF operates iteratively in two steps to predict and update the estimations of the concentrations of Glu, Xyl, and EtOH. In the first step, at each time t k , given the previous a posteriori estimate of the state of the systemx kk , the kinetic model is integrated numerically from t k to + t k 1 to produce a priori estimates of the states at + t k 1 ( + x k 1k ).
represents the process noise (assumed to be independent and normally distributed). During the same step, a priori estimates of the covariance matrix ( + P x k 1k ) are also calculated by propagating it forward from t k to + t k 1 using Equation (3) and taking the P x kk as initial conditions (Zhou et al., 2012).
is the Jacobian matrix of the kinetic model in the time relates to the state of the system at + t k 1 following equation: where the function + (ˆ) h x k 1 relates the states at + t k 1 to the measurement vector + y k 1 and +~( ) is the noise associated with the measurements. At + t k 1 , the new measurements of the concentration of Glu, Xyl, and EtOH are used to update the a priori estimates of the state variables and their covariance matrix to produce the a posteriori estimates x k 1k 1 . This is the update step in the CD-EKF algorithm.
First, the Kalman gain ( + K k 1 ) is calculated using Equation (5) where C is the Jacobian matrix of + (ˆ) h x k 1 . Finally, a posteriori estimates of the states and the covariance matrix are calculated using the update equations (Equations 6 and 7, respectively).
where I is the identity matrix. At the end of each iteration, the a posteriori covariance matrix ( + + ) P x k 1k 1 is used to calculate the 95% confidence interval of the a posteriori state estimates Before building the hybrid model, it is necessary to develop, identify, and calibrate the kinetic and the PLS models. The kinetic model was identified off-line using data from fermentation 1 (Table 1), and the PLS models were calibrated using semi-synthetic samples.

| Kinetic model
The kinetic model describing the dynamics of Glu, Xyl, Fur, furfuryl alcohol (FA), 5-HMF, HAc, EtOH, and biomass was implemented in  The kinetic model was developed following the notation described by Sin et al. (2008). The model was based on the uptake rates of Glu, Xyl Fur, 5-HMF, and HAc and the reaction rates of FA, EtOH, and biomass were expressed as a linear combination of the uptake rates of Fur, Glu, and Xyl using the stoichiometric matrix (shown in Table 2). The reaction rates for each state variable were modeled using the following expressions: • The uptake rate of the substrates (Glu and Xyl) followed Monod kinetics with substrate inhibition (Equation 9 in Table 3).
• The inhibition effects by Fur, FA, 5-HMF, and HAc were modeled by multiplying the reaction rates by the inhibition term shown in Equation 11 in Table 3.
• Product inhibition of the uptake rates of Glu and Xyl was modeled by multiplying the respective uptake rates by the empirical term shown in Equation 12 in Table 3.
• Competitive inhibition (Glu inhibits the uptake of Xyl) was modeled using the inhibitory term shown in Equation 13 in Table 3.
By combining the stoichiometric matrix with the uptake kinetic rates, a model with 8 ordinary differential equations and 32 parameters was obtained (the full set of differential equations and the list of parameters are shown in the Supporting Information Material).
The parameters were estimated by fitting the model to off-line experimental data obtained with HPLC. The profiles of Glu, Xyl, Fur, and EtOH of fermentation 1 (Table 1) were used for the parameter estimation (the results of the parameter estimation are shown in F I G U R E 3 Conceptualization of the kinetic model. (1) Glucose uptake, (2) xylose uptake, (3) furfural uptake, (4) furfural is converted into furfuryl alcohol (FA), (5) FA inhibits the uptake of glucose, (6) FA inhibits the uptake of xylose, (7) furfural inhibits the uptake of glucose, (8) furfural inhibits the uptake of xylose, (9) furfural inhibits the uptake of 5-hydroxymethyl furfural (5-HMF), (10) 5-HMF inhibits the uptake of glucose, (11) 5-HMF inhibits the uptake of xylose, (12) 5-HMF uptake, (13) 5-HMF is converted into acetate, (14) acetic acid uptake, (15) acetic acid inhibits the uptake of glucose, (16) acetic acid inhibits the uptake of xylose, (17) (2014) were used as the initial guess for the parameter estimation (see Supporting Information Material). To improve the fitting, a local sensitivity and identifiability analysis was conducted to find the parameters with a larger impact on the output, and the combinations of parameters that are not linearly correlated (Brun & Reichert, 2001). As a result of the sensitivity and identifiability analysis, a subset of three parameters (v max, Glu , K iHAc, Glu , and K i, Glu, Xyl ) was selected for further estimation (the results of the sensitivity and identifiability analyses are shown in Supporting Information Material). A second parameter estimation was performed with the three selected parameters using the bootstrap framework described by Sin and Gernaey (2016). First, a reference parameter estimation was done using nonlinear least squares. Then, 100 synthetic data sets were created by randomly sampling from the residuals of the reference parameter estimation using the Monte Carlo method (Sin & Gernaey, 2016). The three parameters were then re-estimated for each of the 100 synthetic data sets using nonlinear least squares.
This process resulted in a population of 100 estimates for each parameter. The mean, the SD, and the covariance matrix were calculated for each of the estimated parameters to assess their uncertainty. The uncertainty in the estimated parameters was propagated through the model using a Monte Carlo approach to assess the uncertainty in the model output (Sin & Gernaey, 2016).
The uncertainty in the parameters estimated using the bootstrapping method was considered to follow a normal distribution. The model was finally validated with the fermentations 2-4 (validation results are shown in Supporting Information Material IV).

| Calibration of the data-driven models
Three independent PLS models were developed to calculate the concentrations of Glu, Xyl, and EtOH from the spectral data collected with the ATR-MIR spectrophotometer. Due to the limited medium availability, a specific procedure to calibrate the PLS models was designed to (1) account for the matrix absorbance; (2) distribute the leverage of each calibration sample evenly along the experimental space; and (3) minimize the correlation between the concentrations of Glu, Xyl, and EtOH. First, 1.5 L of wheat straw hydrolysate was fermented (as described in Section 2) to remove the Glu and Xyl from the media (the Glu was entirely removed, but a residual concentration of 0.2 g/L of Xyl remained in the fermentation media). Then, the fermented medium was centrifuged (at 4000 rpm for 5 min) to remove the biomass, and the EtOH was stripped out by sparging sterile air for 24 h at 35°C. Finally, the volume was adjusted to 1.5 L by adding 150 ml of deionized water. The resulting broth was the fermentation matrix without Glu, Xyl, or EtOH, and it was used to prepare 21 semi-synthetic samples for the calibration set. Note that the fermentation matrix used to calibrate the PLS models corresponded to the matrix at the end of the fermentation, and therefore, the PLS models did not take into account the changes in the fermentation matrix. This approach is valid under the assumption that the contribution to the variance of the spectral matrix caused by the change in the concentration of the analytes is much more significant than the contribution caused by the change in the matrix. The samples were prepared based on a three dimensional Latin hypercube (LH) experimental design. This design was chosen to distribute the leverage of the different samples evenly along with the experimental space (Montgomery, 2009; the calibration space is shown in Table 4).
To minimize the correlation between the concentrations of Glu, Xyl,
T A B L E 4 Calibration space considered for the partial least squares models

| Tuning the CD-EKF
To initialize the CD-EKF, initial estimates of the measurement and process noise covariance matrices must be provided. This  (Table 1).

| RESULTS AND DISCUSSION
The performances of two real-time monitoring strategies were compared using three different cellulose-to-EtOH fermentations. with the prediction of Xyl. This interference was arguably due to the accumulation of glycerol and biomass in the fermentation media during the growth on Glu. S. cerevisiae produces glycerol to regenerate NAD + /NADH and to maintain the redox balance within the cells (Palmqvist et al., 1999). This was further confirmed by the offline measurements with HPLC, which showed that during the Glu consumption phase, glycerol reached a concentration of 3 g/L (data not shown). Moreover, during the Xyl consuming phase, the glycerol concentration did not significantly change, and glycerol remained in the fermentation matrix after stripping the EtOH. In consequence, the matrix used to calibrate the PLS models contained a high-glycerol concentration and a low-biomass concentration, which did not represent the properties of the matrix at the beginning of the fermentation. The second trend in the prediction of Xyl occurred during the Xyl consumption phase, where the glycerol concentration remained constant. In this second phase, the PLS model was able to describe the trend of Xyl in all the fermentation, with a slight overestimation in fermentations 2 and 4 (Figures 5a2, 5b2, and 5c2).
EtOH was accurately predicted in fermentation 3 (Figure 5b3), but it showed significant deviations during the Glu consumption phase in fermentations 2 and 4 (Figures 5a3 and 5c3). Likely, these deviations are also caused by matrix effects. Nonetheless, during the Xyl consumption phase, the concentration of EtOH is accurately predicted in all fermentations (Figure 5a3, 5b3, and 5c3). The PLS models were able to successfully describe the general trajectories of Glu, Xyl, and EtOH in three fermentations with different initial conditions. These predictions can be used to assess the progress of the fermentation and demonstrate the robustness of the calibration procedure followed to calibrate the models. However, the matrix effects had a CABANEROS LOPEZ ET AL. due to the clogging of the recirculation loop (Figures 5c1-3 and 5d3).
As such, it can be clearly seen that the proposed hybrid model is robust to punctual deviations in the measuring signal, which the PLS model lacks. This stability is desirable when dealing with spectroscopic data, where the signal is highly sensitive to disturbances such as air bubbles or solid compounds.

| Perspectives for industrial application of the hybrid approach
In the current implementation, the hybrid model was updated with new measurements every 15 min to produce new estimates of the system state. This updating frequency allows monitoring of the progress of the fermentation in real-time, and can be used to detect deviations in the fermentation profile (e.g., due to contamination by lactic acid bacteria) and to take corrective actions. However, given the dynamics of the system, other control applications (such as feed rate control) would require higher updating frequencies. The data acquisition and the computational time to solve the hybrid model are the two factors limiting the updating frequency. One of the main advantages of using spectroscopic methods over other monitoring tools such as at-line HPLC is that spectroscopy allows the fast and automated collection and analysis of new spectra, resulting in a high updating frequency of the state variables without the need for manual sampling (Cabaneros et al., 2019). The spectrophotometer used in this study can collect a new spectrum every minute, and the hybrid model is solved in a few seconds, updating the states of the system every 1.5 min. The possibility to reach high updating frequencies makes this approach also useful for the implementation of control schemes that require faster response times (e.g., for feed-rate control in fed-batch operations).
In the present work, the media was centrifuged before the fermentation to remove suspended solids compounds that could clog the tubing in the recirculation loop. However, in industrial operations, the media is not centrifuged, and high concentrations of suspended solids are present during the fermentation process. This situation would be further aggravated in processes using simultaneous saccharification and fermentation, where the concentration of solid matter is especially high. Such high concentrations of suspended solid compounds can interfere with the collected spectra, reducing the accuracy of the predictions, and even limiting the industrial applicability of some spectroscopic methods (Cabaneros et al., 2019). In this context, ATR-MIR spectroscopy is an attractive option due to its robustness to high concentrations of suspended solids caused by the shallow penetration depth of the light into the media. As such, ATR-MIR spectroscopy has been successfully applied to monitor the progress of various systems with high solid matter on-line, such as the mashing process in breweries (Patent No. WO 2015/155353, 2015. Although ATR-MIR spectroscopy can be used to monitor processes with a high concentration of suspended solids, it would still be expected that the interference of the light with the solid matter affects the precision and accuracy of the measurements. On a CABANEROS LOPEZ ET AL. | 589 practical note, even though on-line proves with a recirculation loop have been successfully implemented in systems with high concentrations of suspended solids, in-line probes are advantageous to avoid clogs in the recirculation loop (Cabaneros et al., 2019).

| CONCLUSIONS
Real-time monitoring of cellulose-to-EtOH fermentations is challenging due to the high complexity of the fermentation media derived from lignocellulosic material. The results of this study showed that relying only on advanced spectroscopic measurements combined with PLS regression models to measure the concentration of Glu, Xyl, and EtOH can yield a good qualitative description of the fermentation progress. However, the interference of other compounds such as glycerol or biomass and the presence of bubbles and suspended solids in the fermentation broth results in noisy and biased predictions that limit the implementation of advanced control schemes. The hybrid approach presented in this study efficiently fuses the predictions of the PLS model and the internal model of the system to correct the inconsistencies of the PLS predictions and produce consistent estimates of the state variables. Having a thorough understanding of the behavior of the measuring system is crucial to tune a robust and stable CD-EKF effectively. The hybrid model presented in this study was calibrated and tuned using only two fermentations, and large amounts of data were not required to develop the state estimator. This is an essential feature as data is often not easily available in industrial setups. The quality of the predicted concentrations of Glu, Xyl, and EtOH using the hybrid model opens the doors towards the implementation of advanced monitoring schemes.
In the current configuration, the hybrid model can be used to monitor fermentations in real-time, to detect deviations in the behavior, and to take corrective actions when needed. Moreover, the high sampling frequency (one sample per minute) and the low computational time required by the model allow updating the estimates every 1.5 min, making this approach suitable to implement real-time control strategies.

ACKNOWLEDGMENTS
The authors wish to thank Prof. Carl Johan Franzén from the Chal-

DATA AVAILABILITY STATEMENT
Data available on request from the authors.