A real‐time flood forecasting hybrid machine learning hydrological model for Krong H'nang hydropower reservoir

Flood forecasting is critical for mitigating flood damage and ensuring a safe operation of hydroelectric power plants and reservoirs. This paper presents a new hybrid hydrological model based on the combination of the Hydrologic Engineering Center‐Hydrologic Modeling System (HEC‐HMS) hydrological model and an Encoder‐Decoder‐Long Short‐Term Memory network to enhance the accuracy of real‐time flood forecasting. The proposed hybrid model has been applied to the Krong H'nang hydropower reservoir. The observed data from 33 floods monitored between 2016 and 2021 are used to calibrate, validate, and test the hybrid model. Results show that the HEC‐HMS‐artificial neural network hybrid model significantly improves the forecast quality, especially for results at a longer forecasting time. In detail, the Kling–Gupta efficiency (KGE) index, for example, increased from ∆KGE = 16% at time t + 1 h to ∆KGE = 69% at time t + 6 h. Similar results were obtained for other indicators including peak error and volume error. The computer program developed for this study is being used in practice at the Krong H'nang hydropower to aid in reservoir planning, flood control, and water resource efficiency.


| INTRODUCTION
In recent years, extreme rainfall and flood events have appeared with an increasing frequency and intensity, exceeding historical values.Some studies have shown that under the impact of climate change, extreme rainfall events may occur more frequently, which can increase the severity of flooding events (Cullmann et al., 2020;NOAA, 2021).Flood forecasting and warning information are thus critical to relevant authorities and departments in planning to prevent, respond to, and mitigate the harmful effects of floods, as well as to evacuate people from flood-prone areas.Furthermore, these flood forecasting results can aid state management agencies, such as irrigation and hydropower plant operators, in controlling floods to ensure project safety and downstream flood safety (WMO, 2011(WMO, , 2017)).
There are different approaches, ranging from simple to complex structures, for modeling the rainfall-runoff relationship.These methods can be divided into two groups: physically based models and data-driven models.Which method to use is determined by the amount of data available in each case study (Beven, 2010).The physicalbased models that have been widely studied and applied in practice, such as MIKE-NAM (by DHI, Denmark), The Soil & Water Assessment Tool (SWAT), Hydrologic Engineering Center-Hydrologic Modeling System (HEC-HMS), and so on, have shown the feasibility of simulating and forecasting the rainfall-runoff scenario (Albo-Salih et al., 2022;Chu & Steinman, 2009;Martin et al., 2012).However, these models often require a variety of data types (e.g., topographic, geomorphological, meteorological, and hydrological data) and in-depth modeling expertise to establish relevant databases on the rainfall-runoff characteristics of each study basin.It also takes a significant amount of time to calibrate and validate the model parameters, which makes it difficult to respond quickly to real-time flood events (Che & Mays, 2015;Şensoy et al., 2018).
Meanwhile, the goal of data-driven models is to establish the correlations between the precipitation and discharge from observed data without considering the relevant physical processes into account.Common models in this group can be listed as linear regression models, nonlinear regression models, autoregressive integrated moving averages, artificial neural networks (ANN), adaptive neuro-fuzzy inference systems, and support vector machines.Among them, the ANN model has been widely applied to solve many different hydrological problems.A number of previous studies have demonstrated that the ANN models outperform other classical techniques in terms of their ability to simulate complex nonlinear processes of input and output variables.The ANN models can be trained to capture the nonlinear rainfall-runoff relationship and predict runoff quantitatively without prior knowledge of basin characteristics (ASCE, 2000a;Kraft et al., 2020).The limitation of this approach is that the user cannot access the internal logic (i.e., black box model), and it does not show the physical characteristics of rainfall-runoff, topographic changes, and surface buffer factors on the catchment (ASCE, 2000b).
The hydrological hybrid model was created to combine the advantages of both types of models.Many authors have investigated the interaction of data-driven models with physically based models such as SWMM (Kim & Han, 2020), SWAT (Yuan & Forshay, 2021), and HEC-RAS (Tamiru & Wagari, 2021) to forecast floods, and urban flooding and shown the superiority of the hybrid model.However, because the application of ANN to hydrological problems in Vietnam is still in its early stages, the tailored study is required to demonstrate both the effectiveness and limitations that may be encountered when using it for flood forecasting decision support on reservoir operation.
This study presents the development of a hybrid machine learning hydrological model, which combines the HEC-HMS and ANN model, for the Krong H'nang hydropower reservoir in Dak Lak province.First, the HEC-HMS physical-base hydrology model is constructed to simulate the rainfall-runoff process in the basin.The outputs from the HEC-HMS model then become the inputs for the ANN model using the Encoder-Decoder-Long Short-Term Memory (LSTM) architecture (Gauch et al., 2021;Kraft et al., 2020).Data from 33 floods from 2016 to 2021 is divided into three data sets for training, validation, and testing with a ratio of 70:15:15.The results of the hybrid model are evaluated in detail, considering the degree of improvement in flood forecasting quality at different lead times compared to the pure HEC-HMS model.Vietnam.The main earth dam with a length of approximately 1.1 km was built on the Ea Krong H'nang River and is located on the National Highway 29, about 80 km northeast of Buon Ma Thuot, Dak Lak province.

| STUDY AREA AND DATA USED
The basin area is about 1168 km 2 with the terrain divided by the Truong Son mountain range and the plateau, forming two distinct topographical regions (Figure 1).The upstream area with a topographic elevation above 500 m is the basalt red soil plateau, planted with industrial crops and influenced by the southwest monsoon during the rainy season from May to November.The downstream area is a plain of rice and cash crops with elevations from 240 to 500 m, interspersed with mountain ranges with peaks of up to 1200 m, forming valleys with steep slopes of 8%-15% during the rainy season from September to December and is strongly influenced by the northeast monsoon, tropical depressions, and storms in the East Sea (EVN-PECC4, 2007).
The observed and HEC-HMS simulated data for the period between September 2016 to December 2021 are used as input data for the hybrid model.These include rainfall data from 15 gauges, and measured inflow data.The HEC-HMS simulated results of these 33 floods (15-min intervals) are presented in detail by Nguyen et al. (2022).Evaporation loss from the reservoir is ignored.The entire data set is divided into subsets with a ratio of 70% for model training, 15% for calibration, and 15% for testing as shown in Figure 2.
The MinmaxScaler function ( 1) normalizes the input data to the range [0, 1] to improve the ANN model's training speed and learning efficiency where x is the raw value and x scaler is the value after normalization.

| Model structure and descriptions
The HEC-HMS-ANN hybrid model structure is illustrated in Figure 3 with an input layer consisting of 15 stations of rainfall data (X 1 , X 2 , …, X 15 ), measured discharge (Q obs ) ranging from the last 8 h ( − t 8) to now (t = 0), and forecasted data from the HEC-HMS model (Q hms ) ahead to + t 6 h.The hidden layer consists of k layer with n neural in each layer.The output layer is the prediction of the reservoir inflow Q pre .
The length of the input data series for the ANN model plays a critical role and has a big impact on the model results.However, there is currently no specific theoretical formula for this.Thus, we can rely on the correlation analysis between flood discharge and its delay or with the delay of another time series (such as precipitation) through the cross-correlation function (CCF) defined by Equation (2) to make a reasonable decision.(2 where k is the number of lag time; σ x and σ y are the standard deviation of x and y; CVF xy k is the cross-covariance function of x and y, determined by the expression (3) where t is time steps; ̅ x and ̅ y are the mean of x and y, N t is the number of data points of time series.Based on the results of cross-correlation analysis of rainfall and runoff data series at lag-time steps shown in Figure 4, the sequence length has a high correlation with the target series (forecast flow) selected from t − 8 h to t + 6 h (see Table 1).

| The hydrological HEC-HMS model
The HEC-HMS was developed by the HEC-US Army (USACE) (USACE, 2021).Figure 5 illustrates the HEC-HMS process that simulates the hydrological processes of a watershed from precipitation to watershed discharge through different physical processes of a water cycle: precipitation, evaporation, infiltration, overland flow, base flow, and stream channel.

| Encoder-Decoder LSTM network
The LSTM is a special neural network architecture, designed by Hochreiter and Schmidhuber (1997) to overcome the weakness of traditional regressive neural networks by its ability to learn information in the long term.For the rainfall-runoff problem, this is extremely meaningful because it is possible to use rainfall-runoff data from the past for a long time to   , 2000).
The Long Short-Term Memory network's iterative architecture module that contains four hidden layers (three sigmoid and one tanh) that interact with one another (Kratzert et al., 2018).
forecast the flow in accordance with the characteristics of water storage and the natural lag time in the basin.
The structure of the LSTM consists of four hidden layers (three sigmoids and one tanh) that interact in a special way (see Figure 6).The idea behind LSTM is that the cell state, which is a kind of straight-line conveyor belt, performs mathematical operations (vector multiplication, addition) to aid in the stable transmission of information across the entire network.The transformation, transmission, and storage of information in the LSTM network through the steps described in expressions from Equations ( 4) to ( 9) (5) where i t , f t , o t are the input, forget, and output gates, respectively; W i , W f , and W o represent the weights connect- ing input, forget, and output gates with the input, respectively; U i , U f , and U o denote the weights from input, forget, and output gates to the hidden states, respectively; | 111 multiplication; σ is the logistic sigmoidal function and tanh is the hyperbolic tangent function.
An encoder and a decoder combined with the LSTM network layers form the Encoder-Decoder LSTM architecture (Figure 7), which is effectively used with problems involving time-sequential data series of length input and output strings that are different and include many input and output variables.

| Model performance assessment
Several commonly used numerical indicators are selected to evaluate the model's performance in capturing the real hydrologic system.These include the percent error in peak discharge, the percent error in discharge volume, and the Kling-Gupta efficiency (KGE, dimensionless).These equations are as follows: • Relative peak error (%): • Relative volume error (%): • KGE: where n is the total number of data points, Q (m 3 /s) denotes discharge, V (m 3 ) is volume values at control points, σ is the standard deviation, and Q ¯is the mean of time-series data.The subscripts sim and obs denote the simulated and observed time series data, respectively.The relative improvement of the results Δ (%) M from the hybrid model compared to those using only the HEC-HMS model for different indices are computed as follows: ) 100(%), where Δ M is the degree of improvement in the M metric (KGE, vol error, peak error); M M ,

| The HEC-HMS model
The Krong H'nang basin, with an area of 1168 km 2 , is divided into six subbasins (from Sub1 to Sub6).The main river from the outlet of subbasin Sub1 through to Krong H'nang reservoir is divided into four river sections (from Reach1 to Reach4) based on the topographic and hydrological characteristics of the basin.The components of the subbasins and reaches are connected through junctions.The Krong H'nang catchment is established on the HEC-HMS model shown in Figure 8.The subbasin area and the reach length are measurable physical parameters as presented in Table 2.There are 15 automatic rain gauges in the Krong H'nang basin.The rainfall data from the stations are calculated and converted to the basin-average rainfall of the subbasins according to the Thiessen polygon method.The optimal model parameters are automatically detected by the Shuffled Complex Evolution-University of Arizona (SCE-UA) method (Duan et al., 1992).The results of running the HEC-HMS model for 33 floods in a 15-min time step (Q hms ) are stored in the database for convenience in training the ANN model (Nguyen et al., 2022).
Parameters of the HEC-HMS model have been calibrated and validated in the paper by Nguyen et al. (2022), with values summarized in Table 3 below.

| The hybrid HEC-HMS-ANN model
The number of hidden layers and the number of neurons in each hidden layer in the Encoder-Decoder LSTM network The model named 20-20-20-1, which consists of 20 neurons on the input layer, two hidden layers with 20 neurons on each layer, and one output layer with one neuron, was selected as the working model.The graph of the loss function or loss function MSE (on normalized data) from the training process is shown in Figure 9.
Figure 10 shows the validation results of the hybrid model for the flood event #24 at forecast time steps t + 1 to t + 6 h.The improvement of forecast quality at lead time t + 1, t + 2 h is minor because the HEC-HMS model and the optimal detection algorithm SCE-UA for Krong H'nang reservoir performed very well at these time intervals.However, at longer forecasting time steps such as t + 5 and t + 6 h, the HEC-HMS model can not capture the flood pattern and it heavily underestimates the flood peak and volume.The hybrid model, on the other hand, yields superior results compared to the use of the only HEC-HMS model.It is able to forecast reasonably the flood pattern as well as the flood peak and volume.
The model results for all flood events (events #24-#33) used in the validation and testing phases are presented in Figure 11 for all three assessment criteria in the form of boxplots.For each box, the line inside represents the median value (50%), while the box height is based on the 25% and 75% values.For KGE index, a better model will provide a smaller box height and the box approaches KGE = 1.However, for peak and volume error, they will approach 0%.It can be easily seen from Figure 11 that for lead time from t + 1 to t + 2, the HEC-HMS model performs very well and the use of the hybrid model does not add much value or even unnecessary at these forecast timesteps.However, starting from the lead time of t + 3, the accuracy of the forecasting results from the HEC-HMS model decreases quickly and with longer lead times such as t + 5 and t + 6, these forecast results are not reliable to use.In opposite, the hybrid model with the integrated ANN structure significantly improved the forecast accuracy for all indices.For KGE index, on average, the results are improved by 55% and 69% at the lead times t + 5 and t + 6, respectively.For the error of total flood volume, a crucial index in flood forecasting for reservoirs, the results are enhanced by 26% and 33%, respectively.It can also be seen from Figure 11 that the HEC-HMS flood peak error tends to be negative (i.e., the model tends to underestimate the true value), while that of the hybrid model tends to be positive (i.e., Q sim p > Q obs p ), which is conservative in real-time reservoir operation.Details of the comparisons for all forecasting times and for all the criteria are summarized in Table 4.
Table 4 shows the calculation results of the hybrid model increases or decreases relative to the HEC-HMS model based on the median value compared to their ideal value by Equation ( 16).

| CONCLUSION
In this study, a hybrid machine-learning hydrological model based on the combination of the HEC-HMS physical model and the ANN is built and tested for the Krong H'ang hydropower reservoir located in the Dak Lak province, Vietnam.The model uses an Encoder-Decoder-LSTM network architecture and is trained, In detail, the increase of the KGE index is Δ KGE = 16% at the t + 1 h forecast step and Δ KGE = 69% at the t + 6 h forecast step when using the hybrid model compared to only the HEC-HMS model.In addition, the results also demonstrate that the predictive quality of the hybrid model is superior in various indices, such as the flood peak error and flood volume error.
Because of the strong support of computer technology and large data, the hybrid machine-learning hydrological model has enhanced the accuracy of the flood prediction model, particularly over lengthy forecast periods.Forecasting ahead of time allows state management organizations, building businesses, and people to respond to changing rainfall and flood conditions.

ACKNOWLEDGMENTS
The authors acknowledge Song Ba Joint Stock Company for providing the database and would like to thank the HEC-Support Team for their help during this study.Nguyen Phuoc Sinh was funded by Vingroup JSC and  Note: The data are bold to highlight the increase or decrease of the hybrid model compared to the traditional HEC-HMS model.
Krong H'nang is one of the major hydroelectric projects located in the Ba River basin, which is among one of the largest river catchments in the South-Central region ofF I G U R E 1 Map ofKrong H'nang river basin.The main basin is divided into six subbasins (Sub1-6).Color indicates the elevation from low (blue color) to high (red color).The star markers show the locations of 15 automatic rain gauges (T1-T15).The green triangle represents the location of Krong H'nangdam and the automatic water level gauges.

F
I G U R E 2 Observed basin-average rainfall P ave (mm) and reservoir inflow Q obs (m 3 /s) data of 33 events with a time interval of Δt = 15 min between 2016 and 2021 (data between floods is not displayed).The data set is divided into three parts, including training (train), validation (valid), and testing (test) at the ratio of 70:15:15.NGUYEN ET AL.| 109

F
I G U R E 3 Structure of the hybrid hydrological model combining Hydrologic Engineering Center-Hydrologic Modeling System (HEC-HMS) and artificial neural network models.F I G U R E 4 The cross-correlation function (CCF) between the reservoir's inflow with (a) average precipitation and (b) the reservoir's inflow at lag-time steps.T A B L E 1 Input and output data for the HEC-HMS-ANN hybrid model at different time steps.Data Time steps (h)

✓
Notes: The symbol ✓ represents the available data used, and the remaining cells have no data.Abbreviations: ANN, artificial neural network; HEC-HMS, Hydrologic Engineering Center-Hydrologic Modeling System.F I G U R E 5 Illustrate the natural rainfall-runoff process and the flow diagram used in the hydrological Hydrologic Engineering Center-Hydrologic Modeling System model (USACE

12
are the evaluation indicators of the HEC-HMS model and the hybrid model, respectively; M i is the ideal value of indica- tors: = = = KGE 1; vol error 0; peak error 0.

F
I G U R E 9 The loss-epochs error value of the model training process.calibrated, and tested with historical rainfall data, measured discharge, and forecasted flow from HEC-HMS based on data of 33 flood events recorded during the period from 2016 to 2021.The results show that the HEC-HMS-ANN hybrid model significantly improves the prediction accuracy when compared to the single HEC-HMS model, especially at longer forecasting time.

F
I G U R E 10 Flood prediction results for the Hydrologic Engineering Center-Hydrologic Modeling System-artificial neural network (HEC-HMS-ANN) hybrid model #24 at time steps t + 1 to t + 6 for improved forecasting quality compared to the HEC-HMS model.KGE, Kling-Gupta efficiency.

F
I G U R E 11 Summary and compare forecasting results of the Hydrologic Engineering Center-Hydrologic Modeling System-artificial neural network (HEC-HMS-ANN) hybrid model and the HEC-HMS model at forecasting steps t + 1 to t + 6 for different evaluation criteria, including KGE, volume error (%), and flood peak error (%).KGE, Kling-Gupta efficiency; NSE, Nash-Sutcliffe efficiency; RRMSE, relative root-mean-square error.T A B L E 4 Calculation of the increase or decrease in the goodness of the hybrid model for 10 flood events (from #24 to #33) at different time steps.