Prediction of the water level at the Kien Giang River based on regression techniques

Model accuracy and runtime are two key issues for flood warnings in rivers. Traditional hydrodynamic models, which have a rigorous physical mechanism for flood routine, have been widely adopted for water level prediction in river, lake, and urban areas. However, these models require various types of data, in‐depth domain knowledge, experience with modeling, and intensive computational time, which hinders short‐term or real‐time prediction. In this paper, we propose a new framework based on machine learning methods to alleviate the aforementioned limitation. We develop a wide range of machine learning models such as linear regression (LR), support vector regression (SVR), random forest regression (RFR), multilayer perceptron regression (MLPR), and light gradient boosting machine regression (LGBMR) to predict the hourly water level at Le Thuy and Kien Giang stations of the Kien Giang river based on collected data of 2010, 2012, and 2020. Four evaluation metrics, that is, R2, Nash–Sutcliffe efficiency, mean absolute error, and root mean square error, are employed to examine the reliability of the proposed models. The results show that the LR model outperforms the SVR, RFR, MLPR, and LGBMR models.


| INTRODUCTION
The Kien Giang River, one of two major tributaries within the Nhat Le River system, flows through the Le Thuy and Quang Ninh districts in Quang Binh province (Vietnam) (Figure 1).Spanning approximately 69 km in total length (Ly et al., 2013), this area has been known as a "flood navel" since the formation of Le Thuy and Quang Ninh topography.During the historical flood in October 2020, more than 50,000 houses at the foothills of the Truong Son mountain range were submerged, and thousands of villages and hamlets were isolated.The flood peak at the Le Thuy station reached 4.88 m, exceeding the warning level III and 0.97 m higher than the historical flood peak in 1979.
Accurately forecasting the river water level is critical for early flood warning and flood disaster mitigation.In general, there are two main approaches to predict the water level.The former relies on physically based models, such as the MIKE HYDRO River, HEC-HMS, SOBEK, and EFDC.Although these models have high accuracy, they typically require a variety of datasets, including topographic, meteorological, and hydrological data, and intensive computational time for model simulation.Therefore, physically-based models are unsuitable for short-term and real-time prediction.Moreover, the development of a physically based model frequently demands in-depth knowledge and expertise in the hydrological field (Atashi et al., 2022).
An alternative approach is a data-driven model that collects and analyzes the statistical relationship between input and output data.This approach can help overcome the limitations mentioned above of the physically based model.The machine learning (ML) model has been used for flood forecasting since the 1990s and is one of the most popular frameworks utilized in the data-driven method.Recent studies suggest that ML can be a powerful tool for F I G U R E 1 The Kien Giang river system and location of meteorological and hydrological stations.
flood forecasting because it can be built quickly and effortlessly without understanding the underlying process.In addition, other main advantages of ML models are the shorter computational time, faster calibration and validation, and easier usage compared to the physically based models (Mekanik et al., 2013).
To our knowledge, no previous studies have applied the ML approach to predict river water levels for the Quang Binh province.The goal of our study is to apply regression methods, including linear regression (LR), support vector regression (SVR), random forest regression (RFR), multilayer perceptron regression (MLPR), and light gradient boosting machine regression (LGBMR) to predict water level at the Le Thuy and the Kien Giang stations.

| Regression methods
Regression is a mathematical method in statistics used to analyze the relationship between a quantity to be forecasted over time and historical data.In this study, five regression techniques of machine learning are applied to generate data-driven models.The primary process when developing these models is called the "learning phase," where the relationship between the input and output variables of the system is established (Guo et al., 2021): with the available data: where x is the input vector, y is the output vector, n is the number of observations and f is the regression function.

| LR
LR is a machine learning algorithm based on supervised learning, which models a target prediction value based on independent variables.Different regression models differ based on the kind of relationship between the dependent and independent variables they are considering and the number of independent variables getting used.The regression's dependent variable can be referred to as an outcome variable, a criterion variable, an endogenous variable, or a regressand.
Respectively, the independent variable can be referred to as an exogenous variable, a predictor variable, or a regressor.
In Figure 2a, the input X is the work experience and the output Y is the salary of an individual.In this example, the regression line is the best-fit line for our model.The hypothesis function for LR is as follows: where x is the input training data (univariate-one input variable), and y is the labels to data (supervised learning), θ 1 is the intercept, and θ 2 is the coefficient of input training data x.
Cost function (J): To achieve the best-fit regression line, the model updates θ 1 and θ 2 values after each iteration, therefore minimizing the error differences between predicted value (pred) and true value (y)

| SVR
The SVR approach proposed by Drucker et al. (1996) was employed herein for nonlinear regression.The regression function of SVR can be expressed as follows (Liong & Sivapragasam, 2002;Yu et al., 2006): where w is the weight vector, φ is the nonlinear mapping function, and b is the bias term.According to the fundamental concept of structural risk minimization to prevent overfitting, Equation (4) can be further expressed as follows: where C denotes the cost parameter or penalty parameter, ξ and ξ * are nonnegative slack variables, and e is the parameter of the insensitive loss function.On the basis of Lagrange multipliers, the optimization problem of SVR can be written as a dual pattern (Wu et al., 2008): where α and α * are Lagrange multipliers and K is the kernel function.In this study, a commonly used radial basis function was employed to estimate the kernel function.Detailed descriptions of the SVR methodology can be found in the literature (Brereton & Lloyd, 2010;Chang & Lin, 2011).

| RFR
The RFR approach proposed by Breiman (Breiman, 2001) is a tree-based ensemble ML technique based on the combination of bagging (bootstrap aggregation) and the random subspace method.During training, the binary recursive partitioning of classification and the regression tree are used to build each decision tree.Once a forest of trees has been constructed, predictions from each tree are aggregated as the final result.The advantages of the RFR approach are its simplicity and the low number of tuning hyperparameters.The RFR algorithm, as shown in Figure 2b, is summarized as follows (Choi et al., 2019;Li et al., 2016;Muñoz et al., 2018;Nguyen et al., 2015): 1. On the basis of the bootstrap method, a subset of samples is randomly produced with replacements from the original data set.2. These bootstrap samples are employed to construct regression trees.The optimal split criterion is used to split each node of the regression trees into two descendant nodes.The process on each descendant node is continued recursively until a termination criterion is fulfilled.3.Each regression tree provides a predicted result.Once all of the regression trees have reached their maximum size, the final prediction is determined as the average of the results from all of the regression trees: where tr is the number of trees, N tree is the maximum size of the trees, and h ˆtr denotes the prediction of each regression tree.Detailed descriptions of RFR have been provided in previous studies (Biau & Scornet, 2015;Boulesteix et al., 2012).MLPR, which belongs to the feed-forward neural network, includes three layers: input, hidden, and output layers (Figure 2c).The neural network in MLPR consists of neurons, biases assigned to neurons, connections among neurons, and weights connecting neurons.Mathematically, the regression function of MLPR can be expressed as follows (Chen et al., 2020;Khan & Coulibaly, 2006): r q qr q MLPR (10) where c r denotes the bias of the rth output neuron, u qr is the weight connecting the qth neuron in the hidden layer to the rth neuron in the output layer, and a q (x) represents the activation function of the hidden neuron, which can be expressed in terms of F: . , where d q is the bias of the qth hidden neuron, x p is the input variable, and v pq is the weight connecting the pth neuron in the input layer to the qth neuron in the hidden layer.
Several types of activation functions can be employed, including linear, sigmoid, hyperbolic tangent (tanh), and rectified linear unit (ReLU) functions.In the training process of MLPR, the back-propagation algorithm is used for adjusting the weights connecting neurons to minimize errors (Jhong et al., 2018;Rumelhart et al., 1986).Details regarding the theory of MLPR have been provided in previous studies (Govindaraju & Rao, 2000;Hagan et al., 1996;Simon Haykin, 1998).

| LGBMR
LGBMR uses four main algorithms to improve computational efficiency and prevent overfitting: gradient-based oneside sampling (GOSS), exclusive feature bundling (EFB), a histogram-based algorithm, and a leaf-wise growth algorithm (He et al., 2022;Ke et al., 2017).As shown in Figure 2d, the leaf-wise growth algorithm allows the identification of the leaf node with the largest split gain while preventing overfitting.In addition, LGBMR adopts the histogram-based decision tree algorithm to divide continuous floating-point features into a variety of intervals to reduce the computational power required for prediction.Moreover, GOSS and EFB are used to reduce the number of samples for accelerating the training process of LGBMR.
The objective function of LGBMR can be written as follows: where l is the loss function, Ω is the regularization term of a decision tree f i at the t time iteration, y i is the true (objective) value, and y ˆi is the predicted value.On the basis of the boosting algorithm, Equation ( 12) can be further expressed as follows: where − y ˆi t 1 is the predicted value at the t − 1 step model and f t (xi) denotes the new predicted value at the tth step.To solve the objective function, the Newton method is employed to simplify Equation ( 13) into the following equation where g i and h i are, respectively, the first and second derivatives of the loss function, which can be expressed as follows: Samples in the regression trees are related to leaf nodes.The final value of loss can be determined from the accumulation of the loss values of the leaf nodes.Thus, with the use of I j to represent the sample of leaf j, Equation ( 14) can be rewritten as where T is the total number of regression trees, w is the weight of the lead node, and λ is the regularization parameter.Thus, the optimal objective function can be solved by minimizing the quadratic function.Detailed descriptions of LGBMR have been provided in previous studies (He et al., 2022;Ke et al., 2017;Tang et al., 2020).

| Evaluation metrics
To quantitatively evaluate the performance of the five models, the following evaluation metrics are employed: • Mean absolute error (MAE) • Root mean square error (RMSE) where H i obs and H i pre are the observed and predicted water levels, respectively; H ¯obs and H ¯pre are the mean of observed and predicted water levels, respectively.R 2 ranges from 0 to 1.An R 2 value of 1 indicates that the predicted values are equal to the observed.The NSE ranges from −∞ to 1.The closer the value of NSE to 1, the better the result of the model is.RMSE and MAE are the model's error metrics with optimal value at zero.Two additional metrics were considered, namely peak error (PE) and Error of time to peak (Δt).These are two critical metrics in flood warning to further evaluate the model's capacity in forecasting the peak water level in terms of value and time.
where H p obs and H p pre are the peak of observed and predict water level, respectively; T p obs and T p pre denote the observed and predicted time to peak, respectively.The closer value of PE and Δt are to zero, the higher the performance of the model is.

| RESULTS AND DISCUSSION
In this study, we predict water levels at the Le Thuy and Kien Giang stations using five regressive machine learning models.We employ six evaluation metrics R 2 , NSE, MAE, RMSE, PR, and Δt to assess the models' performance.
Accordingly, the ideal input data set parameters and the most effective machine learning strategy for this issue are selected.Regressive machine learning models are installed using the Python 3.7 environment.

| Study area and data
The Nhat Le River basin, which has a total area of 2612 km 2 , includes three main tributaries: Kien Giang, Long Dai, and Nhat Le (Figure 1).Three medium-sized irrigation reservoirs, An Ma, Cam Ly, and Rao Da, have a small flood capacity of 22.1, 6.9, and 11.6 million m 3 , respectively.Other reservoirs have smaller capacities, so the impact of their operation on downstream water levels is negligible.
In this study, we use hourly rainfall and water level data collected from three stations, namely Kien Giang, Le Thuy, and Dong Hoi.The data set contains the flood season of 2010, as weel as the whole year of 2012 and 2020.We split the data set into two parts: training data from 2010 to 2012 (10,297 samples), and testing data from 2020 data (8785 samples).

| Input data set to predict water level at Le Thuy and Kien Giang stations
We use the data set of hourly water levels and rainfalls recorded at three stations (Kien Giang, Le Thuy, and Dong Hoi) in the years 2010, 2012, and 2020.Through previous research and analysis, it is discovered that the water level at a station is influenced by previous rainfall and water levels at that station and nearby stations.One problem is determining the length of past input data to acquire the highest performance.This study also employs hourly data, allowing us to predict the water levels for 1, 6, and 12 h-ahead.To improve accuracy, we experiment with integrating predicted rainfall data from hydro-meteorological stations for the following 1, 6, and 12 h as additional features.The following sections will present these experiment's findings.

Choosing the number of time lags and time leads
To get the optimal historical data length, we configure the water level forecasting model for Le Thuy and Kien Giang stations as follows: In which H ˆt h KG are the forecast water levels at Le Thuy and Kien Giang stations, respectively, t is the present time, t + h is the predicted time (h = 1, 6, or 12 h).
are the rainfalls at the three stations at present, in the past and the future (predicted rainfall), k is the number of time lags, are the water levels at three stations (Le Thuy, Kien Giang, and Dong Hoi) at present and in the past.
With the LR model, the number of time lags is selected between 1 and 10 h (k = 1-10) to predict the water level for the next six time leads (h = 6 h) at the Le Thuy station.The results show that the use of data from 2 to 10 h ago to predict the water level for the next 6 h yields good result with the values of R 2 and NSE are both above 0.99.Especially, at k = 3, all four metrics demonstrate the best performance, with MAE and RMSE error of 2.35 and 0.25 cm, respectively.In addition, the peak error in the 2020 flood, Ep is 0.7 cm when k = 3.Therefore, we decided to select the 4-day input data set (at time t, t − 1, t − 2, and t − 3) to predict the water level at 1, 6, and 12 h (h = 1, 6, 12) in the following section.(Figure 3).

| Prediction results of water levels in Kien Giang and Le Thuy
In this experiment, the water level at Le Thuy station was predicted using five regression models: LR, RFR, SVR, MLPR, and LGBMR.The scatter plot (Figure 4) compares the models' water level predictions with a 6h time lead.We can observe that the MLPR model underperforms F I G U R E 3 Evaluation performances of the linear regression models.MAE, mean absolute error; NSE, Nash-Sutcliffe efficiency; RMSE, root mean square error and R2.
F I G U R E 4 Scatter plot of forecast and observed water level at the Le Thuy station from five regression models.LGBMR, light gradient boosting machine regression; LR, linear regression; MLPR, multilayer perceptron regression; RFR, random forest regression; SVR, support vector regression.
compared to the the LR, RFR, and LGBMR models, while the SVR model performs the worst as the distribution of points furthest from the 1:1 line.When the water level is below warning level III (H = 2.7 m), the RFR and LGBMR models produce results close to the 1:1 line.However, the forecast value of the RFR and LGBMR models is also lower than the observed value as they fail to estimate the flood's peak water level.
Tables 1 and 2 show the details of the assessments of the capacity to replicate water levels at the Le Thuy and Kien Giang stations, respectively.Observed water level and rainfall data in 2010 and 2012 and predicted rainfall are used as training data at three locations: Kien Giang, Le Thuy, and Dong Hoi.The model's output is the water level at Le Thuy and Kien Giang for the following 1, 6, and 12 h.The 2020 data set was used to test the model.The SVR model performs poorly, while the three models LR, RFR, and LGBMR, give good results in terms of NSE, R 2 , MAE, and RMSE value.
Forecasting is crucial for flood disaster mitigation and prevention.One of the requirements to determine a model's validity is accurate reporting of flood peaks and time to peak.With NSE and R 2 greater than 0.99, RMSE of 0.007 and MAE of 0.11, and especially the error of flood peak less than 8% (equivalent to 39.3 cm in case h = 12 h at Le Thuy station), and 2% (i.e., 2.87 cm in case h = 12 h at Kien Giang station), and Δt = 1 h.The results demonstrate that the LR model gives the best result.First, the results of the LR model are shown in Tables 1 and 2. Second, the relationship between rainfall and the water level is linear.
Third, the water level of the stations (Le Thuy, Kien Giang, Dong Hoi) is the linear influence.
Figures 5 and 6 show the actual water level measured and the predicted water level generated by LR model with time lead 6h during the historic flood from October 6, 2020, to October 21, 2020, at two stations (Le Thuy and Kien Giang).We can observe that the predicted line and measured line are very close to one another.However, the results of the linear regression model are still unreliable, particularly at the flood peak.The method of employing input data, such as the water level in Dong Hoi, may be to blame for this.Due to the high tide impacts on Dong Hoi water level, the water level process curve changes occasionally.The water level in Le Thuy thus tends to fluctuate similarly to the water level in Dong Hoi when training the data, despite the possibility that the water level in Le Thuy is mostly unaffected by high tides.To further improve the model prediction accuracy, more research should take a variety of other variables into account, such as the geography of the location, ground cover, and initial moisture conditions.

| CONCLUSIONS
In this study, we have developed models to predict the water level at the Le Thuy and Kien Giang stations on the Kien Giang River using LR, RFR, SVR, MLPR, and LGBMR regression techniques.The models are trained and validated using the hourly rainfall and water levels  In future studies, we will consider the addition of other inputs such as flows, tidal levels, rainfalls at nearby stations, or forecasted rain from the hydro-meteorological center stations (instead of using actual rainfall data measured at timestep t + 1), constructing water level forecasting models for hydrological stations and other "virtual" stations along the Kien Giang and Nhat Le rivers.In addition,  other machine learning techniques, such as deep learning algorithms, will also be explored to enhance the accuracy of future water level forecasts.

F
I G U R E 2 The conceptual diagram for (a) linear regression, (b) support vector regression, (c) random forest regression, (d) multilayer perceptron regression, and (e) light gradient boosting machine regression.
datasets at three stations in 2010, 2012, and 2020.The study's findings indicate that the best water level prediction result at the Le Thuy station at 6hrs ahead when using input data of hourly rainfall and water level in the period [t − 3, t], as well as the predicted rainfall at three hydrometeorological stations in the area.The model can forecast both the trough and peak of the water level.Evaluation metrics R 2 , NSE, MAE, and RMSE show that the application of the data-oriented model is feasible and reliable in predicting water levels in which the LR outperforms the other four regression methods.

F
I G U R E 5 Comparison of forecast and observed water level 6 h ahead at the Le Thuy station using the linear regression model.F I G U R E 6 Comparison of forecast and observed water level 6 h ahead at the Kien Giang station using the linear regression model.
Evaluation metrics of LR, RFR, SVR, MLPR, and LGBMR models at the Le Thuy station.
T A B L E 1Note: The significance of bold which is the LR model gives the best result.Abbreviations: LGBMR, light gradient boosting machine regression; LR, linear regression; MLPR, multilayer perceptron regression; RFR, random forest regression; SVR, support vector regression.
T A B L E 2 Evaluation metrics of LR, RFR, SVR, MLPR, and LGBMR models at the Kien Giang station.
Note: The significance of bold which is the LR model gives the best result.Abbreviations: LGBMR, light gradient boosting machine regression; LR, linear regression; MLPR, multilayer perceptron regression; RFR, random forest regression; SVR, support vector regression.