Deep hybrid modeling of a HEK293 process: Combining long short‐term memory networks with first principles equations

The combination of physical equations with deep learning is becoming a promising methodology for bioprocess digitalization. In this paper, we investigate for the first time the combination of long short‐term memory (LSTM) networks with first principles equations in a hybrid workflow to describe human embryonic kidney 293 (HEK293) culture dynamics. Experimental data of 27 extracellular state variables in 20 fed‐batch HEK293 cultures were collected in a parallel high throughput 250 mL cultivation system in an industrial process development setting. The adaptive moment estimation method with stochastic regularization and cross‐validation were employed for deep learning. A total of 784 hybrid models with varying deep neural network architectures, depths, layers sizes and node activation functions were compared. In most scenarios, hybrid LSTM models outperformed classical hybrid Feedforward Neural Network (FFNN) models in terms of training and testing error. Hybrid LSTM models revealed to be less sensitive to data resampling than FFNN hybrid models. As disadvantages, Hybrid LSTM models are in general more complex (higher number of parameters) and have a higher computation cost than FFNN hybrid models. The hybrid model with the highest prediction accuracy consisted in a LSTM network with seven internal states connected in series with dynamic material balance equations. This hybrid model correctly predicted the dynamics of the 27 state variables (R2 = 0.93 in the test data set), including biomass, key substrates, amino acids and metabolic by‐products for around 10 cultivation days.

Human embryonic kidney 293 (HEK293) is a well-known immortalized mammalian cell line of human origin.A major advantage in relation to other mammalian cell lines, such as Chinese Hamster Ovary (CHO) cells, is the ability to express proteins with human Post translational modifications.Research on these cells has taken major strides as scale-up methods have been developed to make protein expression economically competitive (Dumont et al., 2016;Portolano et al., 2014;Swiech et al., 2012).Over the last few years, HEK293 were used in the manufacturing of several vaccine products and candidates against SARS-CoV-2 in response to the COVID-19 pandemic (Ren et al., 2020;Sanchez-Felipe et al., 2021;van Doremalen et al., 2020).
Given the ubiquitousness of the HEK293 cell line in academia and industry, the development of reliable dynamic models to support cell line and process digitalization is of critical importance.Bioprocess Digital Twins (DT) based on high-fidelity mathematical models are currently under development in many industrial sites (Udugama et al., 2021).Mechanistic modeling is a traditional approach to develop DTs (e.g., [Hartmann et al., 2022;Monteiro et al., 2023]).
However, mechanistic models of HEK293 cells combining cell growth kinetics and metabolism are still scarce.Nguyen and co-authors developed a mechanistic kinetic model of recombinant adenoassociated virus production in HEK293 cells by triple transfection (Nguyen et al., 2021).This model covers the main kinetic steps starting from exogenous DNA delivery to the reaction cascade that forms viral proteins and DNA.Joiner and co-authors have reviewed modeling approaches of recombinant adeno-associated virus production in HEK293 cells (Joiner et al., 2022).Helgers and co-authors developed a dynamic model to describe cell growth, metabolism (glucose, lactate, ammonium and amino acids) and HIV virus-like particle production in fed-batch cultivations (Helgers et al., 2022).
With the emergence of machine learning and deep learning in recent years, bioprocess digitalization approaches that rely on this type of models are gaining popularity (Helleckes et al., 2023;Mowbray et al., 2021;Mowbray et al., 2022).Machine learning relies however on big data resources that are still lagging in the bioprocess industries (Udugama et al., 2021).A promising approach is to combine machine learning with mechanistic knowledge in hybrid modeling workflows.The combination of both approaches may increase the predictive power of models, improve model transparency and may decrease the data dependency in relation to purely data driven models (Agharafeie et al., 2023;Cuperlovic-Culf et al., 2023;Helleckes et al., 2022;Mukherjee & Bhattacharyya, 2023;von Stosch et al., 2014).Fu and Barford applied the hybrid modeling approach to a hybridoma cell line 6BB expressing a monoclonal antibody (Fu & Barford, 1996).The hybrid model consisted of bioreactor material balance equations (system of ordinary differential equations (ODEs)) coupled with a feedforward neural network (FFNN) to predict the kinetics of substrate consumption, byproduct accumulation, cell growth, product formation as well as cell composition.Teixeira and co-authors applied a similar serial hybrid modeling strategy for BHK-21 cells expressing a fusion glycoprotein IgG1-IL2 (Teixeira et al., 2005;Teixeira et al., 2007).Aehle and coauthors developed a serial hybrid model (FFNN + material balances) for on-line estimation of viable cell concentration in fed-batch CHO cultures (Aehle et al., 2010).Narayanan and co-authors developed a serial hybrid model (FFNN connected to material balances) for a CHO fed-batch process (81 batches in a 3.5 L bioreactor) (Narayanan et al., 2019).Bayer and co-authors applied the same serial hybrid modeling strategy to analyze data obtained by intensified design of experiments to reduce the validation burden of CHO cultures in a biopharma quality-by-design context (Bayer et al., 2021).
The neural networks research field is rapidly evolving towards complex multilayered data representation architectures (Alzubaidi et al., 2021).Multilayered (deep) FFNNs were proven to better approximate nonlinear functions than three-layers (shallow) FFNNs (e.g.[Liang & Srikant, 2016]).With a significant delay, hybrid modeling is taking the first steps towards the integration of deep learning into its framework.The training of deep neural networks (DNNs) combined with systems of ODEs may be challenging due to the high central processing unit (CPU) cost associated with the computation of gradients by the sensitivity method.Deep FFNNs coupled with systems of ODEs have been recently investigated by Pinto and co-authors for bioreactor hybrid modeling (Pinto et al., 2022).Deep learning based on adaptive moment estimation method (ADAM) (Kingma & Ba, 2014) and modified semi-direct sensitivity equations was shown to systematically improve the predictive power of deep hybrid models over their shallow counterparts (Pinto et al., 2022).
A promising network architecture for bioprocess dynamic modeling is the LSTM network originally proposed by Hochreiter & Schmidhuber, 1997.LSTMs are a particular type of recurrent neural networks (RNNs), that use several gated units (multilayer cell structure) and a cell state layer, with the ability to approximate complex dynamics (Smagulova & James, 2019).Hansen and coauthors recently applied LSTMs for modeling phosphorous removal in a wastewater treatment plant characterized by complex mixed microbial dynamics (Hansen et al., 2022).Cheng and coauthors reported a multilayered hybrid modeling workflow for a wastewater treatment process that combined a mechanistic model, a convolutional neural network (CNN), a LSTM and a FFNN (Cheng et al., 2021).In this paper, we extend the deep hybrid modeling framework proposed by (Pinto et al., 2022) to LSTMs combined in series with dynamic material balance equations.Deep hybrid models based on LSTMs were systematically compared with classical hybrid models based on FFNNs.The proposed hybrid modeling framework was applied to describe HEK293 fed-batch culture dynamics.Experimental data collected in a parallel bioreactor system were used to train the hybrid models.To the best of our knowledge, this the first study comparing hybrid LSTM structures with the classical hybrid FFNN structures for bioprocess modeling.

| Cell culture and analytics
A GSK proprietary HEK293 cell line and chemically defined culture medium were used for the cell count expansion.Precultures of the cell line were grown in shake-flasks (500 mL) at 37°C and 5% CO 2 atmosphere with a shaking frequency of 160 rpm (140 mL of culture).
GSK aims at a targeted process scheme that consists of a growth phase (6-7 days) and adenovirus virus production phase (3-4 days).
The goal of the current set of cultures was to optimize cell growth only.Although cultures were maintained up to 10 days, no viral infection was performed.In total, 20 cell cultures were carried out in 250 mL vessels (Ambr systems), with initial seed of 0.4 Mcell/mL.The pH was controlled at 7.2 with NaHCO 3 7.5% and sparged CO 2 together with overlay aeration.The dissolved oxygen (DO) was controlled at 40% by sparging pure oxygen.Stirring was adjusted to around 20 W/m 3 .The cultivations were initiated in batch mode and switched to fed-batch mode after 48 h.The basal medium was the same in all cultivations, but feeding solutions (11 unique solutions) applied for each cultivation changed from one culture to another.The different feeding solutions consisted of mixtures of amino acids, glucose (Glc), glutamine (Gln), pyruvate (Pyr), vitamins and other micronutrients such as selenium and magnesium dichloride (MgCl2).
The feeding mixtures were designed using statistical design of experiments (DoE) such as to "excite" the biological system with very diverse process conditions and to collect an information-rich data set.
Feedings were carried once a day in a quasi-simultaneous addition of a bolus of all the feeding solutions involved.It had a constant daily feed volume of 8 mL, with an additional glucose feed provided in case its concentration was expected to deplete in the following day (based glucose concentration and on a cell-specific consumption rate).
Sampling was performed daily, where the viable cell density (VCD) and viability were measured using a Vi-Cell cell counter (Beckman, Indianapolis, USA).Samples were also assayed for Glc, lactate (Lac), Pyr, Gln, ammonium (NH4) and lactate dehydrogenase (LDH) with a CedexBio-HT metabolite analyzer (Roche).The remaining metabolites and amino acids were assayed off-line by Nuclear Magnetic Resonance spectroscopy (NMR) at Eurofins Spinnovation (Oss, The Netherlands).According to previous calibrations, the measurement error is approximately 10% for VCD and 20% for the metabolites.

| Reaction correlation matrix
The reaction correlation matrix, S, was inferred from cell culture data using the state-space reduction method proposed by Pinto and coauthors (Pinto et al., 2023c).Briefly, the data of concentrations, feeding and sampling volumes were used to estimate the cumulative reacted amount over time of each species in the 20 fed-batch experiments (described in Supporting Information S1).The cumulative reacted amount of each species was organized in a twodimensional data matrix, IR.The rows represent the process time of every fed-batch experiment staked vertically.The columns represent the 27 measured bioreaction compounds (viable cell count, Xv, glucose, Glc, Lactate, Lac, Glutamine, Gln, Pyruvate, Pyr, Glutamate, Glu, Ammonium, NH4, Alanine, Ala, Arginine, Arg, Asparagine, Asn, Aspartate, Asp, Citrate, Cit, Cysteine, Cys, Formic acid, For, Glycerol, Glyc, Histidine, His, Isoleucine, Ile, Leucine, Leu, Lysine, Lys, Methionine, Met, Phenylalanine, Phe, Proline, Pro, Serine, Ser, Threonine, Thr, Tryptophan, Trp, Tyrosine, Tyr and Valine, Val).A normalized matrix, IR ̃, was obtained by dividing each IR column by the respective maximum absolute value, IR max , with ⊘ the Hadamard division.In the next step, principal component analysis (PCA) was applied to decompose IR ̃in a matrix of scores, Sco, and a matrix of coefficients, Coeff .
The alternating least-squares PCA algorithm was adopted in MATLAB ® for this purpose (function "pca" with option alternating least-squares (ALS)).Finally, the reaction correlation matrix was obtained by denormalization of the matrix of coefficients, with ⊗ the Hadamard multiplication.The matrix S contains correlation information between consumption and or production of bioreaction compounds.
Inclusion of the reaction correlation matrix significantly improves the model parsimony and predictive power (preliminary study in Supporting Information S1).

| Deep hybrid modeling method
The models applied to the HEK293 fed-batch cultures are based on the deep hybrid modeling framework proposed by Pinto and coauthors (Pinto et al., 2022).It consists of a deep neural network connected in series with a system of ordinary differential (ODEs) equations (Figure 1).The ODEs were derived from material balance equations of the 27 biochemical species contained in a perfectly mixed bioreactor compartment, taking the following general form: with C a vector of 27 concentrations, S a (27 × 7) reaction correlation matrix, r a (7 × 1) vector of specific reaction rates, X v the viable cells concentration, D the dilution rate and C in a (27 × 1)   vector of concentrations in the feed stream.Due to the discrete time implementation of the LSTM (described below), Equation 4was implemented in discrete time as follows, (5) with δ t the discretization time step.The material balance Equation 5′ was abbreviated as ODE( 27) in the hybrid model specification.
The specific kinetic rates, r ( ) t , were modeled by a DNN as a function of cultivation time and concentrations of species, C t ( ).
DNNs were implemented as a sequence of nh + 2 interconnected layers, starting with an input layer, followed by nh hidden layers and a final output layer.The input layer, abbreviated as In( 27), always had 27 inputs corresponding to concentrations, C t ( ),normalized by their maximum value, C max , A feedforward layer at hidden positions k nh = 1, …, takes the following general form, with y k { } denoting the output vector of hidden layer k, the hidden layer activation function, A peephole LSTM hidden layer ( (Gers et al., 2000;Hochreiter & Schmidhuber, 1997)) was also implemented as alternative to feedforward layers.A peephole LSTM layer at hidden position k has inputs, x t ( ) and additionally the internal state vector, z t ( ), The peephole LSTM layer has an internal structure consisting of 4 interconnected gate layers, namely a forget gate layer, an input gate layer, an output gate layer, and a cell state layer, with hf hi ho hz , , and { } denoting the output vectors of the respective gate layers with size y dim( ) and wf uf bf { , , wi ui bi wo uo bo wz bz , , , , , , , { } } the network weights that need to be optimized during the training.The internal state, z t ( ), has a dynamic update rule defined as, . The LSTM outputs are finally calculated as follows, A peephole LSTM layer at hidden position k and with ny outputs is abbreviated as LSTM(ny) with the inputs given by the outputs of the preceding layer, The DNN structure is finalized with the output layer corresponding to the specific reaction rates (always 7), abbreviated as rates (7), Since there is no generic way to choose the optimal size of a DNN, several configurations (784 hybrid model configurations) were evaluated with different sequences of layers (either feedforward or peephole LSTMs), with varying depth and varying number of nodes in the hidden layers.When all hidden layers are of the feedforward type, the resulting model is classified as a FFNN hybrid model.When at least one of the hidden layers is a peephole LSTM, the resulting model is classified as a LSTM hybrid model.

| Deep learning method
The raw data acquired in the Ambr ® system was preprocessed to a suitable format for training the hybrid models (details in the  27)) were adopted.The optimal weights dropout probability was searched between 0 and 0.45.The number of iterations was fixed to a sufficiently large number (4 × 10 4 ) to ensure convergence.The overall results are summarized in Figure 4.The dropout probability had a more pronounced effect on the average WMSE of the LSTM structure.
The lowest test WMSE for the LSTM network was obtained with dropout probability of 0.1, while for the FFNN the same results were obtained for dropout probability between 0 and 0.1.For this reason, a fixed weights dropout probability of 0.1 was adopted in every test performed irrespective of structure.27)).The combination of a ReLU as the first hidden layer followed by one LSTM hidden layer yielded the overall best predictive power (In(27)-ReLU(16)-LSTM(7)-Rate(7)-ODE( 27)).Two stacked LSTM layers also achieved a similar result although at the cost of higher complexity (Table 1).Stacking LSTM layers gives the model the ability to fit complex high-order functions (Jiang et al., 2021).In this regard, the model improvement by stacking LSTM layers or introducing at least one LSTM layer led to similar performances.In conclusion, stacking LSTM layers was not advantageous for this problem, as the complexity (number of weights) increased without a sensible improvement of the predictive power.Overall, the average training and testing WMSE for the best LSTM hybrid model are around 35% and 13%, respectively, lower compared to the best FFNN hybrid model (Figure 5 and Table 1).

| Searching for the optimal hybrid model configuration
Despite the higher number of weights for the LSTM model compared to the FFNN (1071 and 431 respectively), the average AICc for the LSTM model is only 3% higher.Noteworthy, the LSTM hybrid model seems to be less affected by the test/validation/train data resampling, as it leads to narrower ranges for the training and testing WMSE (Figure 5 and Table 1).

| Analysis of the best LSTM hybrid model
The hybrid LSTM model (In(27)-ReLU(16)-LSTM(7)-Rate(7)-ODE( 27)), which achieved the lowest average and dispersion values of the training and testing WMSE, was analyzed in more detail.Figure 7 shows the predicted over measured concentrations for each of the 27 concentrations individually and for a particular train:validation:test partition, namely the one with the lowest train+test WMSE (Figure 6a).With few exceptions the coefficient of determination (R 2 ) of predicted over measured concentrations is higher than 0.85.
The only exceptions were Glyc, Pro and Pyr.In the case of Pro, the R 2 for the testing subset was only 0.61.Pro is one of the metabolites with the lowest variation during the cultivation (around 33% difference between initial and final concentrations).Despite the low R 2 , the dynamic prediction of Pro is in general within the experimental error bounds (see below).Of note, the R 2 for the viable cell concentration, which is the main product of this culture, had a high value of 0.97 and 0.91 for the training and testing data subsets respectively.As a rule of thumb, the R 2 for the training data subset tends to be higher than for the testing subset due to some degree of overfitting.However, several exceptions to this scenario were observed, which included Ala, Asp, Glu, Gln, Gly, His, Ile, Lac, Leu, Lys, Met, NH4 and Thr.These results obviously depend on the data resampling but they suggest a successful training with mitigated overfitting.Figure 8 further details the dynamic profiles of is noteworthy given that fed-batch dynamics are notoriously difficult to predict using mechanistic modeling approaches.Mammalian cell growth kinetics frequently require more complex models accounting for inhibitory effects of by-products such as Lac and NH4 (Pörtner & Schäfer, 1996) and possible several other metabolites that increase the cell death rate (Chong et al., 2011).Fed-batch experiments reach higher cell densities compared to batch, and consequently higher concentrations of toxic by-products are obtained.As example, Lac, NH4 and For varied between 2.62 and 38.67 mM, 0.70-7.21mM and 0.15-2.33mM, respectively in the performed HEK293 experiments.
The hybrid LSTM model efficiently captured such kinetic effects given the accurate prediction of the viable cell dynamics, particularly the transition between the cell growth and decay phases.
3.5 | Is the LSTM architecture advantageous in a hybrid modeling scheme?
Hybrid modeling of mammalian systems, mostly of CHO cultures, has mainly explored the combination of shallow FFNNs and material balance equations in the form of ODEs (Agharafeie et al., 2023).The inclusion of reliable mechanistic equations in the hybrid model generally reduces the data dependency, improves the predictive power (e.g., (Bayer et al., 2021;Bayer et al., 2022;Narayanan et al., 2019;Nold et al., 2023;Senger & Karim, 2003), This corroborates the findings of previous studies (Pinto et al., 2022;Pinto et al., 2023a;2023b;2023c) showing that deep hybrid modeling outperforms shallow hybrid modeling.
As for the difference between deep FFNN and deep LSTM hybrid modes, the results shown in Table 1 and Figures  The LSTM structure is known to capture complex dynamics (short-and long-term memory effects).This is the inherent capability of LSTM since it has three gates (input, output, forget) which allows it to selectively remember or forget information as such it can capture and remember relevant input data features.In biological context, the advantage of LSTM hybrid models are likely to be more noticeable in problems with strong population and/or intracellular dynamics.The FFNN hybrid model includes dynamic ODEs of extracellular variables but completely disregards population and intracellular dynamics.The hybrid LSTM model structure may be particularly effective when cells change significantly their size and composition depending on cultivation conditions.Typically average cell sizes are larger during the exponential cell growth phases compared to the transition and cell death phase and this differences changes for each cell line and growth condition (Nielsen et al., 1997).Previous studies on CHO cells have shown that single cell volume, dry weight and internal composition may change significantly from the start to the end of a cultivation (Széliová et al., 2020).When the cell properties and composition significantly change over a cultivation time window, a complex memory effect may be created.In such cases the LSTM network offers a structural advantage to capture such complex dynamics over the FFNN structure, which treat the population of cells and the intracellular phase as a static process.

F
Figure3.Figure3ashows that one principal PC captures over 92% of data variance whereas 7 PCs cumulatively explain over 99% of data variance.These results evidence strong correlations between the consumption and/or production of the 27 measured compounds.The Figure 4c,d show the training, validation and testing WMSE over the training cycle using the selected parameter (average values among 10 repetitions with different permutations of train/validation/test experiments).These results show that 4×10 4 iterations are enough for training because the lowest average validation WMSE was always achieved early (<1 × 10 4 iterations) in the training cycle.Despite the very high CPU time cost,4 × 10 4 iterations were used for the remaining studies to ensure that the cross-validation minimum is always spotted for every FFNN-or LSTMhybrid structures investigated irrespective of complexity.This is essential to ensure a fair comparison between both approaches.In future studies, the CPU time to train LSTM-hybrid structures may be reduced by employing stochastic regularization based on minibatch size and weights dropout in replacement of cross-validation(Pinto et al., 2022).

A
total of 784 hybrid configurations based on FFNNs or LSTMs were systematically compared.The number of hidden layers, the number F I G U R E 3 PCA of reacted amounts of 20 reactor experiments (6480 time points) and 27 extracellular rates (process descriptors).Data was normalized column wise by dividing by the maximum absolute value of reacted amount.(a) Left axis explained variance by each PC and right axis the cumulative explained variance over PC number.(b) Biplot of PC-2 (2.5% explained variance) over PC-1 (92.6% explained variance).Red dots represent score values.Blue vectors represent the coefficients of process descriptors. of nodes in hidden layers and different types of activation functions were evaluated.In all cases, the first layer was the input layer, In (27), with 27 inputs of normalized concentrations.It then followed one or more feedforward hidden layers with ny outputs (either ReLU(ny), Tanh(ny) or Lin(ny) depending on the activation function) or a peephole LSTM layer (LSTM[ny]).The last layer always had seven outputs corresponding to the specific reaction rates (Rate[7]).The specific rates were then passed to the material balance equations (system of ODEs) to calculate the 27 concentrations (ODE[27]).LSTM based hybrid structures include at least one LSTM layer whereas FFNN hybrid structures included one or more feedforward hidden layers.The "smallest" model with acceptable results was a shallow hybrid FFNN structure with four hidden nodes.Additional layers were added with sizes 4-27 up to four hidden layers.The number of weights varied between 147 and 2464.In the case of hybrid LSTM structures, LSTM sizes varied between 4 and 27.Up to four stacked LSTMs were investigated.The number of weights varied between 275 and 2977.In all cases, structures with number of weights larger than training data points were disregarded.Some FFNNs had a complexity (number of weights) comparable to the LSTM based structures.But in general LSTM based structures tend to have a higher number of weights compared to FFNNs due to their complex internal structure composed of gate layers.The 784 hybrid structures were trained with the same pre-calibrated training method as described in the previous section.The partial results of the top 10 best performing LSTM and FFNN hybrid models are shown in Table 1.The complete set of results of the 784 structures are provided in the supplementary material (Supporting Information S2).Overall, LSTM hybrid models consistently outperformed the FFNN hybrid models in terms of average train and test WMSE over the 10 repetitions.Since FFNNs have comparatively a lower number of weights, their AICc tends to approximate that of LSTM hybrid structures.Thus, the AICc discrimination criterion, which balances the number of parameters and goodness of the model fit (error), does not point to the same conclusions as the minimum test WMSE.In the case of hybrid FFNN structures, the ReLU activation function generally outperformed the ones with Tanh.Nevertheless, the FFNN hybrid structure with the best predictive power comprehended four Tanh hidden layers (In(27)-Tanh(8)-Tanh(8)-Tanh(8)-Tanh(7)-Rate(7)-ODE(

Figure 6
Figure 6 further details the training results for the best LSTM and FFNN deep hybrid models.The results of the 10 permutations of train/validation/test are shown along with the predicted over measured concentration for the best set permutation (lowest training WMSE which is also the lowest overall WMSE for train+test).Overall, the training was rather successful in both model structures as denoted by the linear correlation coefficients (R 2 = 0.972 and 0.968).The LSTM has a slightly higher R 2 for training and test data sets compared to the FFNN.The results of the 10 permutations also show that LSTM hybrid model seems to be less affected by the test:validation:train data resampling as evidenced by the coefficient of variation (CV) (Figure6).The CV of the LSTM hybrid structure was 24.02 and 10.51 for the train and test WMSE respectively.The CV

F
I G U R E 5 Boxplot of the training and testing WMSE and AICc for the best LSTM (In(27)-ReLU(16)-LSTM(7)-Rate(7)-ODE(27)) and best FFNN (In(27)-Tanh(8)-Tanh(8)-Tanh(8)-Tanh(7)-Rate(7)-ODE(27)) hybrid models.Results from 10 train/test partitions obtained by experimentwise random data resampling.The bar represents the median, the box is the first and third quartile, and the whisker is the minimum and maximum.experimental concentrations and corresponding hybrid model predictions for a training and a testing fed-batch experiment (the remaining model predictions for the test cultivations are provided in the supplementary material (Supporting Information S3).The hybrid model correctly predicted the concentration dynamics of the 27 extracellular species, including biomass, key substrates, amino acids and metabolic by-products.The predicted dynamic profiles were always within the error bounds for both the training and testing fedbatch experiments.The predictive power of the LSTM hybrid model
5 and 6 reveal that that the difference between the best LSTM and FFNN hybrid models in terms of average training WMSE shows 35 percent of improvement.As for the difference in terms of the predictive power (average testing WMSE) the gap is narrower but still significant with 13 percent of improvement.The training and testing WMSE dispersion (given by the CV) is in general much lower for the LSTM hybrid model compared to the FFNN model (23% and 66% improvement, respectively).This implies that LSTM hybrid models are in general more robust to the data resampling.Despite these advantages, the LSTM hybrid model generally requires a larger number of parameters due to its inherent complexity, and their training is much slower compared to FFNN hybrid models with the same number of weights (around 50% higher CPU time for LSTM hybrid models).These observations hold for the 784 evaluated structures, with the results shown in supplementary material (Supporting Information S2).
merging DNNs and first principles equations was investigated and showcased with a HEK293 fed-batch process.Hybrid structures based on multilayered FFNNs and LSTMs in many different configurations (standalone, combined, stacked, with varying depths and layers sizes) were systematically compared.For the HEK293 process studied, the LSTM hybrid models outperformed the FFNN hybrid models in most of the scenarios tested in terms of training and testing error.The best LSTM hybrid structure (In(27)-ReLU(16)-LSTM(7)-Rate(7)-ODE(27)) showed a high predictive power of the dynamics of 27 measured extracellular compounds.LSTMs may learn complex dynamics that mimic the biochemical "memory" of cell populations with varying intracellular composition exposed to a transient environment for long periods of time.Nevertheless, the inclusion of ODEs of extracellular compounds also confer to FFNN hybrid models the capacity to effectively describe extracellular dynamics.Given the results obtained, both FFNN and LSTM networks are good candidates for hybrid modeling.The adoption of a more complex LSTM structure may justify in such cases where complex population dynamics with strong variations in cell size, gene copy number and intracellular composition introduce a significant dynamic variability to the process.