Basin‐scale monthly rainfall forecasts with a data‐driven model using lagged global climate indices and future predicted rainfall of an adjacent basin

Future long‐term rainfall forecasts are valuable for operating water supply facilities and managing unusual droughts. This study proposes a novel approach to forecast basin‐scale monthly rainfall from lagged global climate indices, antecedent historical rainfall data of a targeted basin, and forecasted rainfall data from a nearby basin using a data‐driven model. The approach is applied to the Han River basin and the Geum River basin, South Korea, for May and June, prone to drought occurrence. An artificial neural network (ANN), a widely used data‐driven model, was employed to build forecasting models for the study basins. Two types of ANN models were constructed: one uses predictors of the lagged climate indices and antecedent rainfall of a targeted basin that have been typically used in previous studies, and the other further uses the forecasted rainfall of an adjacent basin that was first attempted in this study by considering the strong concurrent relationship of monthly rainfall data between nearby basins. The optimal network architectures were determined through the Monte Carlo cross‐validation (MCCV) process in which repeated data subsampling for training datasets was carried out to reduce the output variance and obtain ensemble forecasts. The results show that the proposed ANN model in this study with input variables of the forecasted rainfall of the nearby basin and the lagged global climate indices and the past rainfall of the target basin provides better predictive performance than the model without using the adjacent basin's forecast rainfall. The categorical forecasting skill based on the proposed approach is good: the hit rates and Heidke skill scores ranged from 50.9 to 66.0% and 0.29 to 0.49, respectively. The results confirm that using rainfall forecasts of a nearby basin as an input variable can enhance the ANN model's ability to predict future monthly rainfall.


| INTRODUCTION
Rainfall forecasting is of great importance for managing water resource systems. Timely and reliable rainfall forecasting is necessary to mitigate the adverse impacts of floods and droughts, prepare risk management strategies, plan effective use of available water resources, and schedule reservoir and irrigation systems. In particular, forecasting future rainfall for monthly or more extended periods can help water resource managers to adjust water allocation, control water consumption, and make plans for new water supply facilities in preparation for unexpected severe droughts.
One approach to forecasting long-term rainfall is using machine learning techniques with large-scale climate indices as predictors. It has been widely understood that distant hydroclimate teleconnections between rainfall patterns and large-scale climate signals can be used as a basis for forecasting models. It has been proven by many studies in the world that the ANN model, one of the most widely used machine learning techniques, with inputs of large-scale climate indices, is a practical tool for long-term rainfall forecasting.
In Australia, it has been believed that there are some dominant climate indices such as El Niño-Southern Oscillation (ENSO) and Indian Ocean Dipole to affect the rainfall variability (Chiew et al., 1998;Cai et al., 2001). Abbot and Marohasy (2012) applied artificial neural networks (ANNs) to monthly and seasonal rainfall forecasting in Queensland, Australia, with past rainfall and temperature data inputs and several climate indices. They achieved lower forecasting errors than the results provided by the Australian Bureau of Meteorology's Predictive Ocean Atmosphere Model for Australia (POAMA). They also suggested that forecasting skills can potentially be improved by including any forecast climate index such as Southern Oscillation Index (SOI), an output from POAMA, into the input attributes of neural networks. Subsequently, Abbot and Marohasy (2015) demonstrated the improved monthly rainfall forecasts for the Brisbane catchment, Queensland, Australia, by incorporating both lagged and forecast values of climate indices as inputs. In their study, the forecast input values of climate indices were also independently provided by ANNs with the same lead time as the output value of rainfall. Abbot and Marohasy (2017) used the lagged monthly data of rainfall, temperature, and climate indices to develop ANNbased long-term monthly rainfall forecasting models with a lead time of 12 months for three sites within the Murray Darling Basin in southeastern Australia. For each site, they built two distinct types of ANNs to forecast rainfall for each month: one was optimized for all 12 months, while the other was optimized for each month separately, and then they combined the best skilful forecasts from both ANN models for each calendar month. Mekanik et al. (2013) applied the ANN model to forecast spring rainfall in Victoria, Australia, using the lagged climate indices of Niño3.4, SOI, and Dipole Mode Index (DMI) as predictors. They showed that the generalization ability of the ANN model for out-of-sample test sets was reasonable compared to the multiple regression model for most stations tested in the study. Ghamariadyan et al. (2019) presented a hybrid wavelet neural network (HWNN) model for forecasting monthly rainfall with lead times of 1, 3, and 6 months using antecedent rainfall, temperature, and climate indices of the Niño index and SOI. From applying HWNN to the central interior of Queensland, Australia, the wavelet transform improves the forecasting skill, mainly showing better prediction for extreme rainfall than the traditional ANN model. Several ANN models have been developed to predict long-term rainfall in Africa. Badr et al. (2014) applied ANN modelling to predict summer rainfall anomalies over the Sahel region in Africa using the climate indices of spring sea surface temperature (SST) and surface air temperature (SAT) anomalies. They achieved the bestperforming ANN model with nine principal components (PCs) of candidate climate indices, which showed better predictive accuracy than the other eight statistical models applied in the study. Bello and Mamman (2018) applied ANN to predict monthly rainfall over Kano, Nigeria, using lagged climate indices of ENSO-related drivers and showed superior prediction results relative to a linear model.
Large-scale climate indices were also applied to monthly and seasonal rainfall forecasting in monsoon climate regions. Kumar et al. (2007) successfully employed ANNs to forecast the summer monsoon rainfall over the Odisha state, India, for July to September on monthly and seasonal scales. They used several monthly lagged global climate indices and a temperature-related local climate index as input predictors for the ANN model. The predictors with different lag times were selected based on the correlation analysis, and the model structure was optimized using a genetic optimizer. Hartmann et al. (2008) developed ANNs to forecast summer monsoon rainfall from May to September in the Yangtze River basin, China, using several teleconnection climate indices and other indices calculated from sea surface temperatures, sea level pressures, and snow data from December to April. The most important input predictors among the climate indices considered in their study were identified through sensitivity analyses for each input variable. For the summer rainfall prediction, the most accurate result of five model runs was finally selected by varying the random setting of the initial weights five times. Yuan et al. (2016) investigated the relationship between summer rainfall from June to September in the source region of the Yellow River, China, and global teleconnection climate indices through principal component analysis and singular value decomposition, and then constructed ANNs to predict the summer rainfall using the identified lagged climate indices. Lee et al. (2018) developed an ANN-based forecasting model using several lagged climate indices to predict cumulative rainfall in May and June in the Geum River basin, South Korea. Using Garson's and Olden's connection weights methods (Garson, 1991;Olden et al., 2004), they quantified each input variable's relative importance (RI) to construct the most straightforward and best-performing ANN structure. In their study, K-fold cross-validation (CV) was conducted to select the best model architecture while avoiding over-fitting any specific training data set as well as to evaluate the predictive performance. The results showed that the model performance was acceptable for the May and June rainfall predictions with relative RMSE values below 35% and hit scores above 60%; however, the model performance needs to be improved to predict extremely low rainfall conditions. Lee et al. (2020) constructed ANN models to forecast basin-scale monthly rainfall amounts in the Han River basin, South Korea when agricultural water consumption is very high for May and June. To forecast several months in advance, the antecedent global climate indices and historical rainfall records on a monthly time scale were used as inputs to the model. In particular, the Monte-Carlo cross-validation (MCCV) procedure was applied to create diverse network models to avoid high variance according to the repeated random data sampling for training, validation, and test datasets, and to validate the model performance for an ensemble of forecasts with an uncertainty band. The ANN modelling resulted in satisfactory predictive performance, showing that the predicted uncertainty band well bracketed many observation points. However, the authors emphasize that further improvement is required to capture extraordinarily high-and low-rainfall characteristics.
While developing long-term rainfall forecasting models, previous studies mentioned above made special efforts to improve the predictive performance by incorporating input selection methods, preprocessing techniques, and ensemble methods. Sensitivity analysis (Hartmann et al., 2008), correlation analysis (Kumar et al., 2007), and quantification of relative variable importance (Lee et al., 2018;Lee et al., 2020) were carried out to identify significantly correlated input variables. Additional local climate data or forecasted values of climate indices were used as inputs for the ANN model (Abbot and Marohasy, 2015). Data preprocessing procedures of principal component analysis (Badr et al., 2014;Yuan et al., 2016) and wavelet transformation (Ghamariadyan et al., 2019) were applied to find significant predictors that influence the output of the model. K-fold CV (Lee et al., 2018) and MCCV (Lee et al., 2020) methods were applied to obtain ensemble forecasts and reduce the uncertainty resulting from random data subsampling for training networks. As an extension of improvement efforts, our study has attempted to bridge the methods proposed by Abbot and Marohasy (2015) and Lee et al. (2020), in which the forecast climate indices were used as additional input predictors to improve the prediction accuracy and the MCCV was performed to reduce the variance of predictions. Abbot and Marohasy (2015) considered the concurrent relationships between the target rainfall amount and the forecast values of several climate indices. The forecast models in which the forecast climate indices were additionally used as inputs showed RMSE values in the range of 44.8-52.9 mm for the monthly precipitation forecast in the study area, improving predictive performance compared to the RMSE values of 47.2-59.9 mm obtained without using the forecast climate indices. However, unlike the situation in Australia, there have been no well-known simultaneous global climate indices that significantly correlate with monthly time-scale precipitation in South Korea. Instead, particular attention was focused on the concurrent relationship of rainfall amounts between a targeted basin of interest and an adjacent basin in our study. The idea of this study stems from the fact that there is likely to be a strong correlation of rainfall between two adjacent basins. Therefore, this study investigates the effect of lead rainfall in a nearby basin as an additional input predictor of the forecasting model.
This study proposes an approach to improve the rainfall forecast quality of an ANN model by using future forecasted rainfall data of a nearby basin and lagged climate indices, and past rainfall data of a targeted basin. The approach is applied to forecast monthly rainfall in May and June a few months in advance for the Han and Geum River basins. The MCCV method generates ensemble forecasts considering the uncertainty from diverse training datasets and evaluating the model performance. The improved prediction performance is highlighted by comparing the results without using the forecasted rainfall of the nearby basin.

| Study area
The Korean Peninsula is geographically located on the east coast of the Eurasian continent and is adjacent to the western Pacific. It is therefore highly controlled by the East Asian monsoon, formed by the thermal differences between the Asian continent and the Pacific Ocean, which causes seasonal changes in the direction of the prevailing winds with distinct wet and dry seasons. Influenced by the East Asian monsoon, Korea has distinct hot summer and cold winter with more than half of the annual precipitation falling in the summer months of June, July, and August. Due to uneven precipitation throughout the year, a number of water resource facilities such as multipurpose dams and irrigation reservoirs have been built and operated to stably supply water during the dry season. The northern and western regions in South Korea experienced an extreme drought in 2014-2016, resulting in water storage in most dams and irrigation reservoirs in the Han River and Geum River basins reaching record lows (Hong et al., 2016;Jung et al., 2020).
One of the study areas, the Han River basin, is located in the central region of the Korean Peninsula. This basin encompasses Seoul, the capital of South Korea, as well as Gyeonggi and Gangwon provinces, where nearly 50% of the country's total population lives, consuming a large amount of water. Another study area, the Geum River basin adjacent to the Han River basin, serves as a lifeline for the central and southern regions of Chungcheong province, and the downstream area of the basins has large alluvial plains and requires a lot of water for agriculture. The study basins adjacent to each other actually have similar climatic conditions. Figure 1 shows the geographic locations of the study area and the rainfall stations used to calculate basinwide monthly rainfall data. The Ministry of Environment, Korea provides basin-scale area average rainfall data calculated using the Thiessen method based on rainfall data at measurement points within and around the basin. The red circle in the figure represents the overlapping rainfall stations used for basin-wide monthly rainfall data. As shown in the figure, there are few overlapping rainfall stations. This means that the basin-wide rainfall data in two basins can be treated independently. Figure 2 shows the result of calculating the correlation coefficient of the two basins using the monthly precipitation data for the last 53 years. The high correlation coefficient of 0.58-0.88 means that if the rainfall in a nearby basin can be predicted, it can be used as an input variable for a rainfall forecasting model for the target basin.

| Data
Historical basin-scale precipitation data from 1966 to 2018 for the Han River and Geum River basins were obtained from the Water Resources Information System (www.wamis.go.kr), hosted by the Ministry of the Environment, South Korea. Then, the monthly rainfall data in May and June were extracted as predictands of the ANN models. The monthly data of 45 climate indices provided by the National Oceanic and Atmospheric Administration (NOAA) were downloaded from the websites (www.esrl.noaa.gov and www.cpc.ncep.noaa.gov) from 1965 to 2018 and were used as predictors.
Correlation analysis was performed to determine the candidate predictors likely to influence the forecast target. Pearson correlation coefficients between the antecedent global climate indices and local rainfall with different lags from 1 to 12 months and the local rainfall in the target month were calculated every month. The lagged climate indices and past rainfall with relatively higher correlation coefficient (CORR) values were chosen as candidate predictors. Table 1 summarizes the candidate predictors with values of CORR and lag times for the Han River and Geum River basins. The results for the Han River basin showed that the highest correlation of −0.374 was detected for EAWR with a lag of 7 months for May and −0.450 for NAO with a lag of 6 months for June. For the Geum River basin, the rainfall in May had the highest correlation of 0.365 with 1 month lagged Geum, followed by the 5 months lagged WP with a correlation of 0.351. The highest correlation coefficient of 0.363 was obtained between the June rainfall and AMM with 12 months of lag. It can be seen from Table 1 that the correlations between the lagged climate indices and rainfall in the study area are not high but moderate. The optimal combination of several climate indices among the candidates is expected to provide a satisfactory prediction of the forecasting model. However, excellent predictive performance might not be expected unless additional strong predictors exist.
In this study, particular attention was focused on the concurrent relationship of rainfall amounts between the Han and Geum River basins. The synchronous Pearson correlation coefficients without lag time between monthly rainfall amounts of both basins were calculated to be 0.770 and 0.748 for May and June, respectively, which are much higher than the candidates in Table 1. Therefore, this study uses adjacent rainfall values in addition to past values of climate indices and target basin rainfall to construct rainfall forecasting models.

| Strategy for rainfall forecast
The present study constructed long-term rainfall forecast models using ANNs, widely used in various hydrological problems. Two types of ANN-based rainfall forecasting models were built for each basin; that is, a total of four forecasting models were constructed. The first type of ANN model (ANN1) has predictors of the lagged climate indices and past rainfall data of the target basin, while the second type (ANN2) has the lead rainfall of the nearby basin and past climate indices and rainfall as input variables. ANN2 uses an independent forecast value instead of the observed rainfall data of a nearby basin when forecasting future rainfall in the target basin. In practice, a pair of rainfall amounts for the Han River and Geum River basins was predicted using both ANN1 and ANN2. In the first stage of the preliminary forecast, each basin's rainfall was predicted from ANN1. In the second stage of the operational forecast, the basin's rainfall forecast from ANN1 was used as an input to ANN2 for the rainfall forecast of the other basin. The forecast strategy proposed in this study is illustrated in Figure 3.

F I G U R E 2 Correlation coefficients of monthly precipitation between the Han River and Geum River basins
The ANN1 for the Han River basin has ANN1_MH and ANN1_JH; the former is for rainfall forecast in May, and the latter is for rainfall forecast in June. The ANN1 for the Geum River basin is split into ANN1_MG and ANN1_JG for May and June, respectively. Likewise, the ANN2 types include ANN2_MH, ANN2_JH, ANN2_MG, and ANN2_JG.
The procedure describing the model development and application was summarized as: Step 1. Development of ANN1 type models for the target and adjacent basins 1. Preparing input-output patterns based on the historical data of lagged climate indices and lagged rainfall of the target basin for input nodes, and lead rainfall of the target basin for output node. 2. Composing networks by varying numbers of input and hidden nodes.  3. Obtaining ensemble predictions by repeating datasplitting and initial parameters setting. 4. Determining the best network structure by evaluating ensemble of predictions.
Step 2. Development of ANN2 type models for the target and adjacent basins 1. Preparing input-output patterns based on the historical data of lagged climate indices, lagged rainfall of the target basin, and lead rainfall of the nearby basin for input nodes, and lead rainfall of the target basin for output node. 2. Obtaining ensemble predictions by repeating datasplitting and initial parameters setting. 3. Determining the best network structure by evaluating ensemble of predictions.
Step 3. Rainfall forecasting for the adjacent basin 1. Ensemble rainfall forecasting using ANN1 type models developed in the step 1. 2. Calculating ensemble mean that is used as an input of the ANN2 models in the step 4.
Step 4. Rainfall forecasting for the target basin 1. Ensemble rainfall forecasting using ANN2 type models developed in the step 2. 2. Evaluating quality of the ensemble forecasts.

| Development of ANN model
The procedure of constructing a neural network involves preparing a training dataset, identifying input variables, optimizing the network structure and parameters, and evaluating the predictive performance. A total of 53 input/output patterns were divided into three parts: training dataset for calibration of the network weights, validation dataset for early stopping to avoid overfitting, and test dataset for evaluation of predictive performance. A specific training dataset affects the optimization of the network parameters, leading to a high output variance. To reduce the output variance, we applied the MCCV method, which repeatedly performs random sampling without replacement to compose each subdataset and cross-validation to evaluate the model performance. The entire dataset was randomly subsampled into three parts: 60% of the entire data for training, 20% of the data for validation, and the remaining 20% for the test dataset. The data splitting was repeated 100 times to build 100 individual ANN models that provided an ensemble of rainfall forecasts considering the variability in training datasets. The average of the ensemble members (outputs) was used as the final forecast.
Identifying appropriate input variables is an important step in building a neural network. A set of input variables was determined by assessing each input variable's contribution to the output variable. To quantify the contribution of each input variable, the overall connection weights method proposed by Olden et al. (2004) was used F I G U R E 3 Diagram of forecast strategy in this study. In Olden's method, the relative importance of an input variable i, RIi, is calculated as (a) for each neuron j, the product of the connection weight (wij) between input neuron i and hidden neuron j and the connection weight (wjk) between hidden neuron j and the output neuron k is calculated, (b) and then each product (wij wjk) are summed from j = 1 to several hidden neurons. The higher value of RIi, the input variable has, the more significant the input variable affects the output variable. An input variable with a large RI is regarded as a more influential variable for the output variable. Among the candidate predictors previously selected in section 2.2; significant input variables were determined by removing variables with less RI. Initially, all candidate input variables were used to build a preliminary neural network. Then, the network was re-trained by removing one insignificant input variable with the lowest RI. This process was repeated to achieve the optimal number of nodes in the input layer until the prediction performance no longer improved significantly.
The network structure is highly dependent on the number of hidden layers and their nodes. In this study, the number of hidden layers was set to 1 for a simple network, and the optimal number of nodes in the hidden layer was determined by trial and error. The networks were trained by varying the number of hidden nodes from 1 to 10, and the optimal number was chosen when the network model achieved the best predictive performance. The initial assignment of the connection weights also affected the model performance. To reduce the variance of the output prediction resulting from the uncertainty of initial weights, a total of 100 sets of random initial weights were assigned to the network for the training phase. A backpropagation algorithm was used to calibrate the weights and biases.
A hyperbolic tangent sigmoid function and a linear function were used as the activation functions of the hidden layer and output layer, respectively. The data used in this study were normalized between −1.0 and 1.0 while considering the upper and lower limits of the hyperbolic tangent sigmoid function. The number of epochs was set to 5,000, within which a training process was terminated when the model error for the validation dataset reached a minimum.

| Evaluation of model performance
The prediction accuracy of the ensemble forecasts was evaluated in terms of the normalized mean squared error (nRMSE) and CORR, where n is the number of samples, Q i is the observed values, P i is the predicted values, Q max is the observed maximum value, Q min is the observed minimum value, Q is the average of the observed values, and P is the average of the predicted values. The hit rate (H) and the Heidke skill score (HSS) were used to measure the forecast quality. The hit rate is a fraction of the correct forecast, calculated by dividing the number of correct forecasts by the total number of events. The hit rate ranges from 0 to 1, with a perfect forecast of 1 and the worst value of zero. The Heidke skill score measures the relative accuracy of the forecast over a random forecast. The HSS can be calculated using a contingency table representing possible combinations of observed and forecasted event pairs (Wilks, 2011). An example of a 3 × 3 contingency table for three categorical rainfall forecasts is presented in Table 2.
The formula of HSS is as follows: where H is the number of correct forecasts that agree with observations and E is the hit rate that would be expected by random chance, used as a reference measure of accuracy. T is the total number of forecasts or observations. From the contingency table for the three categories shown in Table 2, the values of can be easily obtained. In the table, the element X ik indicates the number of times the prediction was in the ith category, and the observation was in the kth category. The HSS indicates the fractional improvement of the forecast relative to the random forecast, with ranges from -∞to 1. Negative scores mean that the model forecast is worse than the random forecast, and a zero score is equivalent to the chance forecast, both of which are regarded as no skill. A perfect forecast has a score of 1 (Wilks, 2011;Moreira et al., 2016).
The skill of probabilistic forecasts for the predicted ensembles can be assessed, using the Brier probability skill scores. Brier skill score (BSS) for multicategory forecasts is calculated as where BS is the Brier score that is calculated from Equation (5), and BS ref is the Brier score of reference or baseline predictions which we seek to improve on, where R is the number of classes, N is the number forecasting of instances of all classes, f is the predicted probability for class i, and o is 1 if it is ith class in instant t; otherwise 0. A skill score value larger than zero means that the performance is even better than that of the baseline or reference predictions.

| Contribution of input variables
To identify the optimal predictors of the ANN models, the contribution of each candidate input variable to the output variable was evaluated by computing the relative importance (RI) using Olden's connection weight method. For the Han River basin, the climate index of EAWR (7) was found to be the most influential input to rainfall in May, while EAWR (3) is the weakest. The numbers in parentheses indicate lagged months. Then, the network was re-trained without the weakest index, EAWR (3), and the values of RI were recalculated to find the second weakest index of QBO (8). This process was repeated to find the next insignificant index of WP (12), followed by POL (12), AO (10), POL (7), NOI (12), SPI  (11) were removed in order. The contribution of each variable of the ANN1 type models is summarized in Table 3. The numbers in the table indicate how the input variables were removed. Using a similar procedure, the order of input variables to be removed was determined for the ANN2 type models, and the results are presented in Table 4. Figures 4-7 show the box plots representing the RI values of each input variable of the finally chosen ANN1 and ANN2 models that have the best model performance with optimal number of input nodes. The distribution of the RI is attributed to the random initialization of the weights and biases, as well as random data subsampling for training, validation, and test datasets. It is evident that the trained network weights significantly depend on the training dataset, for example, data subsampling.
For the monthly rainfall in May in the Han River basin, it can be seen from Figure 4 that the climate index of EAWR (7) is the most influential input to the ANN1 model, while Geum (0), the lead rainfall in the nearby basin, is the most significant to the ANN2 model, followed by Han (5), NOI (12), and EAWR (7). It is noticeable that the relative contributions of the finally selected input variables in ANN1 and ANN2 models are somewhat different. As shown in Figure 5, the monthly rainfall in June in the Han River basin has a maximum positive correlation with Han (4) and a maximum negative correlation with SCAND (10) for the ANN1 model. As in the May rainfall forecast, Geum (0) makes the largest contribution to the June rainfall forecast using ANN2. Figures 6 and 7 show the relative importance of each input variables of the ANN1 and ANN2 models for the rainfall forecasts in the Geum River basin. For forecasting rainfall in May, as shown in Figure 6, ANN1 model has significant five positive and four negative input  EA (6) EAWR (2) NOI (4) QBO (7) QBO (8) SCAN (9) SLPD (8) SPI (9) TSA (7) TSA (8) WP (5) GEUM (1)  Figures 4-7 it was apparent that the concurrent rainfall in the nearby basin had the greatest effect on the outputs of the forecasting models.

| Performance of ANN models using lagged indices and past rainfall
The number of input neurons was optimized by the relative importance of the variables, and the number of hidden neurons was determined by trial and error. Table 5 summarizes the statistical performances of the ANN models with inputs of past climate indices in terms of the nRMSE and CORR. The best identification of the network structure was determined based on the minimum nRMSE and maximum CORR. To forecast the monthly rainfall of the Han River basin, the network configurations of ANNs with 11-4-1 (no. where is the number of input nodes, number of hidden nodes, number of output nodes) and 9-4-1 achieved the best performance for May and June, respectively. For the Geum River basin, the optimal ANN models with 9-7-1 and 7-2-1 were finally chosen for forecasting May and June rainfall, respectively. The best ANN model for forecasting May rainfall in the Han River basin had values of nRMSE of 0.113, 0.139, and 0.163, and values of CORR of 0.809, 0725, and 0.641 for the training, validation, and testing datasets, F I G U R E 4 Relative importance of the input variables of (a) ANN1 and (b) ANN2 obtained using the connection weight method for the rainfall in May for the Han River basin respectively. For June, the nRMSE values were 0.118, 0.138, and 0.187, and the CORR values were 0.853, 0.771, and 0.683 for the training, validation, and testing datasets. For nRMSE, lower than 10% is considered excellent model performance. Higher than 10% but lower than 20% is good, higher than 20% but lower than 30% is fair, and lower than 30% is poor (Dettori et al., 2011;Nouri and Homaee, 2018). Therefore, the developed models could provide good rainfall forecasts with nRMSE <0.2 (20%) and CORR >0.6.
The final ANN models for the Geum River basin had values of nRMSE of 0.132, 0.168, and 0.192 with CORR of 0.844, 0.749, and 0.670 for the training, validation, and testing datasets, respectively, for the May rainfall predictions. For June, the nRMSE values were 0.149, 0.167, and 0.185, and the CORR values were 0.817, 0.756, and 0.667 for the training, validation, and testing datasets. The results also showed an acceptable performance.

| Performance of ANN models using lagged climate indices and lead adjacent rainfall
The resultant ANN models using future rainfall information from a nearby basin have architectures of 10-3-1 for the prediction of rainfall in May and 7-8-1 for rainfall in June for the Han River basin. The model structure with 9-3-1 has the best performance both for rainfall prediction in May and June for the Geum River basin. Table 6 shows the performance measures of ANN2 considering lagged climate indices, past target basin rainfall, and future rainfall in nearby basins. Compared with the results in Table 5, the additional use of the rainfall data of the nearby basin at the concurrent time of the target month has a great impact on the model performance for training, validation, and testing datasets. The identified optimal ANN models for both basins had nRMSE values of 0.078-0.113, 0.111-0.132, and 0.135-0.157 and CORR F I G U R E 5 Relative importance of the input variables of (a) ANN1 and (b) ANN2 obtained using the connection weight method for the rainfall in June for the Han River basin values of 0.885-0.920, 0.850-0.876, and 0.797-0.827 for the training, validation, and testing phases, respectively, which outperformed the corresponding models that did not use future rainfall information for the nearby basin. The prediction errors in terms of nRMSE in Table 6 were reduced by approximately 28.9, 21.6, and 21.3% compared to the corresponding results in Table 5 for the training, validation, and testing datasets, respectively. This implies that future rainfall information of the nearby basin at the target month may potentially increase the prediction performance for the studied sites.

| Comparison of model performance with/without forecast rainfall of an adjacent basin
To investigate the effect of the adjacent basin rainfall forecasts on the ANN model, Table 7 compares the model performance between observations and outputs from ANN with/without consideration of near-basin rainfall forecasts for both basins. The model performances were compared in terms of RMSE, nRMSE, and CORR between observed and forecasted rainfall for the test datasets. It can be seen from Table 7 that the model performance is better when using nearby basin rainfall forecasts of the target months. Smaller RMSE and nRMSE and much higher CORR were obtained when considering lagged climate indices and forecasted nearby basin rainfall. This is because monthly rainfall amounts in May and June have a high correlation between the target and nearby basins. It is expected that the more accurate rainfall prediction of the nearby basin is obtained, the better the rainfall forecast performance of the target basin can be achieved. Of course, poorly predicted rainfall in a nearby basin can provide noisy information to the forecasting model. Owing to the satisfactory rainfall predictions of the nearby basins, which were based only on climate indices and past rainfall data, the model performances of ANN2 type models were considerably improved for both basins. For the Han River basin, the RMSE decreases from 39.5 to 31.8 mm, CORR increases F I G U R E 6 Relative importance of the input variables of (a) ANN1 and (b) ANN2 obtained using the connection weight method for the rainfall in May for the Geum River basin T A B L E 5 Summary of the model performances for considering past climate indices information

| Evaluation of model skilfulness for rainfall forecasts
The ANN modelling was performed using the approach suggested in section 2.3, and the results were presented in Figures 8 and 9. The figures compare the observed monthly rainfall records in May and June and the forecasted values with 95% uncertainty bands that were estimated from the forecasted ensembles. In the figures, the solid line represents the observed rainfall data and the dotted line represents the ensemble mean of the rainfall forecasts, and the shaded area means the 95% uncertainty band. Overall, the predicted rainfall shows suitable matches to the observations, even though some observed points have quite deviations from the upper and lower limits of uncertainty band. The modelling resulted in satisfactory predictive performance, showing that the 95% uncertainty band well bracketed many observation points. The ANN models for the Han River basin produced ensembles in which 62.3 and 49.1% of the observed data were within the 95% uncertainty bands of the May and June rainfall forecasts, respectively. In the case of the Geum River basin, the uncertainty band of the May rainfall forecasts contains 49.1% of all observation points, while the band for June contains 50.9% of the observed data, so the forecasting ability is satisfactory. Categorical rainfall forecasts on an ordinal scale can help investigate the observed and forecasted rainfall correspondence. In this study, the observed rainfall data were divided into three categories: below-normal, normal, and above-normal conditions, in which the lower and the upper limits of the normal are defined as 33.3 and 66.7% in ascending rank order of rainfall magnitude, respectively. Rainfall that lies below (above) the lower (upper) limit is defined as below (above)-normal rainfall.
Contingency tables were prepared to investigate the categorical rainfall forecasts. Table 8 compares the contingency tables with and without considering the forecast rainfall near the basin for May and June rainfall forecasting for the Han River basin. As shown in the table, considering lagged climate indices and forecast rainfall near the basin improves the hit rate, representing the number of occasions when the below, normal, and above conditions were correctly forecasted, in comparison with the case of applying just lagged climate indices. For May, the hit rate of ANN2 was 52.8%, while it was 45.3% for ANN1. The hit rates of June rainfall forecasts were 50.9% for ANN1 and 64.2% for ANN2, showing better performance when ANN2 was used for rainfall forecasting. In particular, the number of correct forecasts of ANN2 for the below normal rainfall class was higher than that of the forecasts by ANN1 for both May and June, which implies that ANN2 is useful for predicting drought. From Table 8 and Equation (3), the Heidke skill scores were calculated for the May and June rainfall forecasts for the Han River basin. The HSS of the rainfall forecasts from the ANN1 model is 0.18 for the month of May. This represents that the forecast was improved by 18% over the reference forecast expected by chance. HSS for rainfall forecasts in May using the ANN2 model was 0.29. This means that the forecast quality using ANN2 increased by 29% over the reference forecast, which is better than the result by ANN1. For June, ANN1 has an HSS value of 0.26, while ANN2 provides an HSS of 0.46 with a forecasting improvement of 46% over the random forecast.
The addition of forecast rainfall of the nearby basin increases the HSS by 0.20 relative to the case of using only lagged climate indices and past rainfall as the predictors. Table 9 presents the contingency table for the Geum River basin. The hit rates of ANN1 and ANN2 were calculated to be 60.4 and 66.0%, respectively, for the May rainfall. Similarly, the hit rates for June were 52.8 and 62.3% when using ANN1 and ANN2, respectively. The number of correct forecasts of ANN2 was higher than that of ANN1 for all three classes. The increase in hit rates is attributed to forecast rainfall in the nearby basin. The HSS values for the Geum River basin rainfall forecasts were calculated using Table 9 and Equation (3). For May, the HSS is 0.49 for ANN2 compared to 0.40 for ANN1, which shows a noticeable improvement in forecast quality. For June, ANN1 shows some forecast skill with an HSS of 0.29, but the skill could be significantly improved using ANN2 with an HSS of 0.43. ANN2 improved the forecast skill in HSS value by 0.14 than ANN1. From the comparison of ANN1 and ANN2 in terms of HR and HSS, it can be concluded that the use of In addition, as a result of evaluating the skill of probabilistic rainfall forecasting, it was found that the ANN2 model, which additionally used the forecast rainfall of the nearby basin as a predictor, showed better forecast quality than the ANN1 model.

| DISCUSSION
The predictive performance of data-driven models is highly dependent on the selection of appropriate input variables. There can be numerous combinations of large-scale global climate indices used as potential predictors of rainfall forecasting models. Using a trial and error method is time-consuming to select optimal input variables in data-driven rainfall forecasting models (Abdourahamane et al., 2019). Therefore, in this study, we used the overall connection weights method (Olden et al., 2004) to reduce the time-consuming task of input selection. This method quantifies the relative importance of each input variable to the prediction and eliminates unimportant variables. It is effective when there is a lack of understanding of the physical relationship between potential predictors of climate indices and predictand of rainfall. Based on the input selection method, we identified 7-11 predictors that produced the best predictive performance for the study basins. As shown in Figures 3-6, the identified predictors for the Han River basin were found to be significantly different from those for the Geum River basin. This was mainly due to the different candidate input variables chosen in the initial guessing step presented in Table 1. This also means that it is difficult to find distinct common climate indices that significantly affect May and June rainfall in the study basins, unlike Australia where three dominant climate indices, SOI, ENSO, and DOI, are known to influence rainfall variability. Therefore, when it is difficult to find distinct climate indices with significant correlations, in order to expect good prediction performance, it is inevitable to use many input variables whose correlation with rainfall is not as large as expected, and this can cause the risk of overfitting due to an increase number of input variables. This problem can be solved to some extent by adding only one input variable, the forecast rainfall of an adjacent basin with high correlation, as revealed in this study. Ideally, it can be best to find more influential predictors that affect rainfall variability to improve the forecasting ability. Abbot and Marohasy (2015) demonstrated that the use of both lagged and forecast climate indices with ANN models improved the quality of rainfall predictions compared to using lagged climate indices. However, in our study, the concurrent correlation between the climate indices and rainfall at the time of the forecast have not been strong in the study area, so it was not appropriate to apply this method to improve the quality of the forecast. Even if there are several major climate indices that can be used as additional predictors, a lot of effort is required to build many predictive models for each climate index, and as the number of input variables increases, much effort is needed to prevent overfitting. To overcome this limitation, instead of using many predicted climate indices, we used only one additional predictor, the forecast rainfall in the adjacent basin. This approach was applicable due to the concurrent high correlation of rainfall data between the two study basins. The Pearson correlation coefficients between the rainfall amounts in the two basins greatly exceed 0.7 in both May and June, which is much higher than the correlation coefficients between the target rainfall and lagged predictors presented in Table 1. In addition, in the evaluation of the relative importance of the input variables, as shown in Figures 3-6, the lead rainfall in the adjacent basin was found to be the factor with the greatest influence on the output of the ANN model.
In this study, the improvement of rainfall forecasting ability was confirmed by using the forecast rainfall of the adjacent basin. As shown in Table 3, the proposed approach produced superior performance in rainfall forecasting in the study basins, reducing the RMSE by about 16% compared to the conventional approach using lagged climate indices and historical rainfall. Compared with previous results by Lee et al. (2018), our approach performs better in forecasting the May to June rainfall in the Geum River basin. The sum of the RMSE values of the May and June rainfall in the Geum River basin in Table 7 was 78.5 mm, whereas the previous study showed that the RMSE for the test data set was 86.84 mm.
Uncertainties in artificial neural networks can arise from differences in training data sets, training algorithms, network structures, and parameters. This study focused on the uncertainty arising from various training data sets. Some of 60% of the collected input/output patterns was used as a training dataset. The data length may not be sufficient to achieve the generalizability of the trained model, which can induce a greater risk of overfitting if only one set of the training data set is used. A total of 100 training datasets were constructed to reduce uncertainty due to training dataset selection. This was achieved by using the MCCV technique, which is an iterative random subsampling method, to partition the entire data into three parts: training, validation, and test datasets. These 100 random partitions generated different optimal parameters of the networks, variations in RI values (Figures 4-7), and prediction intervals (Figures 8  and 9). In general, predictive performance is good when the uncertainty bandwidth is narrow and the band contains many observations. Abbaspour (2008) defined the rfactor to measure the intensity of uncertainty, calculated by dividing the mean width of the 95% uncertainty band by the standard deviation of the observed data. Singh et al. (2014) suggested that an r-factor less than 1 is desirable. In Figure 8, it can be seen that the uncertainty band for the May rainfall in the Han River basin includes more observation points than the previous study results (Lee et al., 2020). In the case of June, the proportion lying within the uncertainty band in this study slightly decreased because the average width of the band was narrower than the results of the previous study. In this study, the r-factor was calculated to be less than 1.0 for both May and June, which is slightly smaller than the previous results by Lee et al. (2020).
The modelling results concluded that the prediction accuracy of ANN models could be significantly improved with an additional predictor of the forecast rainfall in the adjacent basin. A pair of rainfall forecasts for nearby basins can complement each other to improve the model performance. The forecasting approach proposed in this study can be broadly extended to other regions where the concurrent relationship of rainfall data between nearby basins is evident.

| CONCLUSION
The main issue regarding long-term rainfall forecasting with artificial neural networks is the selection of appropriate input variables that have a significant impact on forecast skills. Generally, monthly and seasonal rainfall forecast ANN models have used input variables of several lagged global climate indices and antecedent rainfall data of a target area. This study proposed forecasting monthly rainfall on a basin-scale using lagged global climate indices and past rainfall of a targeted basin and future rainfall in adjacent basins to improve the model prediction. In the model development stage, the observed data concurrent with the prediction target month were used as the future rainfall data of the adjacent basin. Then, in the forecasting stage in practice, the observed rainfall of the adjacent basin is replaced with the forecasted value obtained from the independently constructed ANN model with inputs of lagged climate indices and past rainfall data.
The proposed approach was applied to forecast the basin-scale rainfall in May and June for the Han River basin and the Geum River basin, South Korea, where a strong concurrent correlation of rainfall exists. To build the ANN-based rainfall forecast models, the predictors were selected by evaluating the relative importance of input variables. The Monte Carlo cross-validation technique determined the best network structures to reduce the output variance due to diverse data subsampling of training, validation, and test datasets. The resultant preliminary ANN models without using future rainfall information from a nearby basin, from which the forecast value for one basin was later used as an additional input to the other basin's forecasting model, have architectures of 11-4-1 and 9-4-1 for the prediction of rainfall in May and June for the Han River basin, respectively. For the Geum River basin, the preliminary ANN models have structures 9-7-1 and 7-2-1, respectively. The final ANN models with an additional input of the rainfall forecast of the nearby basin have structures of 10-3-1 for rainfall in May and 7-8-1 for rainfall in June for the Han River basin, and 9-3-1 for May and June for the Geum River basin. The forecasting results revealed that the final ANN models based on the proposed approach provided superior predictive performance relative to the preliminary ANN models in terms of the normalized root mean squared errors and correlation coefficients. The former showed nRMSE values from 0.163 to 0.192 and CORR from 0.641 to 0.683, while the latter showed an nRMSE of 0.131-0.162 and CORR of 0.765-0.783 for the test datasets of both study basins. For the categorical forecasts, the final ANN models produced skilful rainfall forecasts with hit rates and Heidke skill scores ranging from 50.9 to 66.0% and 0.29 to 0.49, respectively. It can be concluded that the use of rainfall forecasts of a nearby basin as a predictor enhanced the ANN model's ability to forecast rainfall in the study area for May and June.