East Asia Atmospheric River Forecast With a Deep Learning Method: GAN‐UNet

Accurate forecasting of atmospheric rivers (ARs) holds significance in preventing losses from extreme precipitation. However, traditional numerical weather prediction (NWP) models are computationally expensive and can be limited in accuracy due to inaccurate physical parameter settings. To overcome these limitations, we propose a deep learning (DL) model, called GAN‐UNet, to forecast the AR occurrence, position, and intensity in East Asia. GAN‐UNet can capture the complex nonlinear relationship between the inputs at the past moment, including the vertically integrated water vapor transport (IVT), zonal wind at 850 hPa (U850), and meridional wind at 850 hPa (V850), and the forecast output (IVT, U850, or V850), whose results are comparable to NWP models. In addition, the average model (AM) by integrating the results generated by GAN‐UNet and European Centre for Medium‐Range Weather Forecasts (ECMWF) outperforms all the NWP models selected in this study, demonstrating its potential to improve the performance of NWP through the DL method. Specifically, the 5‐day average F1 scores of the AM are 0.777 and 0.845, whose values are significantly better than those obtained by ECMWF (0.712 and 0.794) in the two key regions of East Asia; The AM 5‐day average intersection over unions are 0.706 and 0.688 while the values of ECMWF are 0.675 and 0.64; in terms of intensity forecast, GAN‐UNet and AM exhibited lower differences in most of the intensity bins, except for the final bin with IVT more than 825 kg m−1 s−1. With this thorough analysis, GAN‐UNet is shown as an effective model to forecast ARs.

Based on the important impact of ARs on EP, it is valuable to accurately forecast their occurrence and intensity.Among all the forecast methods, numerical weather prediction (NWP) may be the most accurate and popular currently.Furthermore, the accuracy of NWP models has been steadily improving until now (Alley et al., 2019).However, NWP still has some obvious shortcomings.One of the biggest problems is that the accuracy of NWP is often limited by problems such as insufficient understanding of the interaction of physical processes in the model and systematic bias (Bauer et al., 2015;Geer, 2021).With the rise of deep learning (DL), data-driven DL models have been developed to predict future weather, and they have shown promising results in predicting different sounding meteorological variables (i.e., temperature, specific humidity, geopotential height, and horizontal wind speed) and surface meteorological variables (i.e., 2 m temperature, 10 m wind components, and sea surface pressure) (Bi et al., 2022;Lam et al., 2022;Pathak et al., 2022).Specifically, FourCastNet (Pathak et al., 2022) was the first DL model to demonstrate forecast results comparable to European Centre for Medium-Range Weather Forecasts (ECMWF; Molteni et al., 1996), the existing state-of-the-art NWP forecast archive in the THORPEX (The Observing System Research and Predictability Experiment) Interactive Grand Global Ensemble (TIGGE; Park et al., 2008), but its forecasts did not surpass ECMWF.Pangu weather (Bi et al., 2022), on the other hand, achieved the distinction of surpassing ECMWF.Subsequently, GraphCast (Lam et al., 2022) from Google DeepMind surpassed Pangu weather.These DL models exhibit excellent overall forecasting performance globally, especially Pangu weather and GraphCast, which outperformed ECMWF in most variables.However, the aforementioned models did not specifically focus on AR events, with only FourCastNet providing a rough evaluation of global AR forecasting.As far as we know, there is currently no relevant research utilizing DL models for forecasting AR events directly.
Nevertheless, DL has some applications focused on ARs, including AR detection (Higgins et al., 2023;Prabhat et al., 2021;Tian et al., 2023) and postprocessing of AR forecasting (Chapman et al., 2019(Chapman et al., , 2022)).Tian et al. (2023) employed an ensemble of 20 different DL models to perform semantic segmentation for ARs.Each model in the ensemble was trained independently and the final result was obtained via majority voting.When testing across the whole test dataset, the ensemble of models obtained an IOU for ARs of 40.5% and performed much better than any single model in the ensemble, with none surpassing an IOU of 38.5%.Chapman et al. (2019) developed a convolutional neural network model as a post-process tool to deal with the vertically integrated water vapor transport (IVT) field from NWP.The results indicated that convolutional neural network post-process reduced the root-mean-square error by 9%-17% while increasing the correlation between observations and predictions by 0.5%-12%.Furthermore, Chapman et al. (2022) developed a variety of DL-based probabilistic predictions, which showed that DL post-process methods are computationally cost-effective, easy to implement, and can compete with or outperform the dynamical ensemble's raw model output from the National Centers for Environmental Prediction Global Ensemble Forecast System.Despite our emerging understanding of AR forecasts based on or partly based on DL, whether ARs can be forecasted using solely the DL model, producing comparable results with NWP, remains unknown.
In light of the aforementioned research gaps and limitations, we develop a DL model called GAN-UNet, which is based on Generative Adversarial Networks (GANs; Goodfellow et al., 2020), to forecast the spatio-temporal ARs over East Asia, evaluate GAN-UNet performance using ERA5 reanalysis data as benchmarks, and compare it with the NWP models.Specifically, we use stacked components in GAN-UNet to describe the details of the forecasts more accurately and implement a hierarchical temporal aggregation strategy to reduce the number of iterations and thus avoid the error accumulation.Overall, the purpose of this study is three-fold: First, to design a DL forecast model for efficient and accurate AR forecast; Second, to verify the feasibility of the DL model by comparing the results of GAN-UNet with those obtained from multiple NWP models; Finally, to integrate GAN-UNet with the state-of-the-art NWP model to explore whether the DL model can improve the results of NWP model.
The remainder of this paper is organized as follows: Section 2 describes data and methods.Section 3 first illustrates the importance of ARs to EP and then demonstrates the forecasting effectiveness of GAN-UNet for AR events compared to NWPs.The quality of forecasting effectiveness is manifested in the accuracy of the forecasted AR events occurring in the key regions, the difference of the AR events in profile and intensity between the forecasts and the labels.Conclusions and discussion are provided in Section 4.

Data
The data used in this study as input variables and labels for GAN-UNet are the fifth generation of ECWMF atmospheric reanalysis data (ERA5, Hersbach et al., 2020) from 1959 to 2022, including specific humidity (q), zonal wind (u), meridional wind (v), and precipitation for the warm season from May to August.The time resolution of q, u, and v is 6 hr (0000/0600/1200/1800 UTC), and precipitation is at hourly resolution.
In 2005, a World Weather Research Program called TIGGE was launched at a workshop at ECMWF, for which one of the main objectives was to strengthen cooperation between operational centres and universities in the development of ensemble forecasting (Swinbank et al., 2016).The TIGGE archive includes forecast data from 10 different operational weather centers, including the ECMWF, the Japan Meteorological Agency (JMA), the Korean Meteorological Administration (KMA), the US National Centers for Environmental Prediction (NCEP) and so on.To compare the forecast effects with GAN-UNet, control forecast data provided by TIGGE are used.The products that have been widely confirmed to exhibit excellent forecast performance with lead times of up to 15 days are selected (Bi et al., 2022;Lam et al., 2022;Nardi et al., 2018;Wick et al., 2013).Specifically, the products are selected from (a) the Environment and Climate Change Canada (ECCC), (b) ECMWF, and (c) NCEP.ECCC and ECMWF only provide the forecasts for 0000 and 1200 UTC.All the models provide forecasts up to a lead time of 360 hr with 6-hr steps.To achieve optimal time consistency with the available data in the NWP, the comparison study uses 0000 and 1200 UTC as starting times, 6-hr steps, and a lead time of 360 hr for both the DL and NWP models.All the variables mentioned above apply with a spatial resolution of 1.5°× 1.5°.

Definition of Different Precipitation Levels, AR-Related Precipitation, and AR Events
In East Asia, rich water vapor from the low latitude is transported during the summer monsoon season, which brings notable precipitation to eastern China, the Korean Peninsula, and Japan.We focus on two key study regions: East China (EC, 27°-34.5°N,114°-123°E) and Korea and Japan (KJ, 31.5°-39°N,126°-135°E), land areas where ARs occur frequently.In this study, we divide all precipitation days into the following categories based on precipitation intensity (Table 1).The AR frequency is the number of AR occurrences divided by all the time steps.The AR-related precipitation is defined as the precipitation covered by the AR grid (Prabhat et al., 2021;Tian et al., 2023;Zhang et al., 2023).When an AR occurs and occupies more than 50% areas of a key region, we define it as an AR event.An AR event intensity is defined as the average IVT intensity of the AR event.

AR Detection Algorithm
ARs are defined as abnormally enhanced plume water vapor transport.We use the AR detection algorithm similar to previous studies (Guan & Waliser, 2015;Liang & Yong, 2021;Mundhenk et al., 2016), but with some simplifications.The ARs over 40°E 180°E, 20°S-60°N (94 × 54 grids) are detected by the following steps: (1) Calculating the IVT using the reanalysis and forecast data.The IVT is defined as: where g is gravitational acceleration, p is atmospheric pressure in p-coordinate, ps is the surface pressure, pt is the atmospheric top layer pressure, and here is 300 hPa.(2) Scanning the grids of IVT greater than 500 kg m 1 s 1 .In this study, an instantaneous absolute IVT threshold of 500 kg m 1 s 1 , rather than a relative threshold like Guan and Waliser (2015) and Pan and Lu (2019), is preferred.It is because the forecast products of TIGGE start from 2008, it is not possible to calculate the relative threshold for consistency with the reanalysis data, which leads to uncertainty in the results of different models.Moreover, 500 kg m 1 s 1 is considered to be the lower bound for moderate to strong ARs and accounts for a significant portion of the AR-related hazards (Liang & Yong, 2021;Mahoney et al., 2016;Nardi et al., 2018;Reid et al., 2020).(3) Detection and isolation of continuous regions of the identified grids, and calculation of the axis length and width of each isolated region.The axis length of each isolated region must be greater than 2,000 km, and the aspect ratio greater than 2 (Guan & Waliser, 2015).(4) If the above criteria are met, the coverage of the isolated region is defined as an AR.

The Proposed Model Architectures and Related Parameters
GANs (Goodfellow et al., 2020) have been successful in a wide range of applications, including weather forecast (Gong et al., 2022;Ravuri et al., 2021).The basic idea behind GANs is to train two neural networks simultaneously in a two-player minimax game framework, where the generator tries to generate realistic samples that can fool the discriminator, and the discriminator tries to distinguish between the real and fake samples generated by the generator.In weather forecast, GANs can be used to generate high-quality weather forecasts by training the generator on historical weather data and using it to generate future weather predictions.The discriminator can then be used to evaluate the accuracy of the generated forecasts by comparing them with actual weather data.
However, a single GAN may not always be adequate for generating high-quality outputs that accurately represent the underlying distribution, particularly when the target distribution is highly complex and contains intricate details that are challenging to capture using a single generator.To address this limitation, researchers have proposed the use of multiple GANs that are arranged in a hierarchical fashion, referred to as "stacked GANs" or "hierarchical GANs" (Durugkar et al., 2016;Ghosh et al., 2018;Karras et al., 2017).
Drawing inspiration from stacked GANs, the present study proposes a novel GAN architecture referred to as GAN-UNet, comprising of two generators and one discriminator, in order to effectively capture the fine details of the forecast results.Datasets from 1959 to 2022 over 40°E 180°E, 20°S 60°N are used for training, validation, and testing.Similar to Bi et al. (2022), one year ( 2003) is used for validation, three years (2008, 2013, and 2018) for testing, and the rest for training.However, unlike Bi et al. (2022), this study uses a discontinuous selection strategy, which can avoid different climate change signals due to different periods of the training and testing datasets (Chen & Wang, 2022;Tian et al., 2023).The proposed GAN-UNet framework is composed of four main components (Steps 1-4), as illustrated in Figure 1.In Step 1, the generator1 is employed to generate a rough sketch of the forecasted IVTs, producing results 1.In Step 2, the discriminator is trained using gradients computed from the loss function passed to it from generator1.During training, generator1 and the discriminator play a minmax game, where generator1 aims to minimize the loss while the discriminator strives to maximize it.This minmax game is primarily reflected in the loss function, where, with each iteration of training, the sum of the losses between the results generated by generator1 and the discriminator and their respective labels becomes increasingly smaller.The specific form of the loss function is described in Equation 3 in the following text.As the training progresses, both the generator1 and discriminator become more adept at their respective tasks until the discriminator can no longer distinguish between the results 1 generated by generator1 and the actual labels.Using the trained generator1 and discriminator, we obtain results 2. In Step 3, we refine the results two obtained in Step 2 by utilizing generator2 for training.Finally, as the frequency of extreme values, including maximum and minimum values, is significantly lower in space and time, the trained network tends to underestimate its maximum while overestimating its minimum (Bi et al., 2022;Chen & Wang, 2022).To address this issue, we fine-tune the results three obtained in Step 3 using a univariate cubic equation, thereby enhancing the network's performance in the extreme IVT in Step 4: where i denotes the longitude-latitude grid index; IVT r (i) and IVT t (i) are the results three (i.e., without tuning) and tuned IVT forecast value at the ith grid.The parameters a, b, c, and d are four tunable parameters, which are determined by the training data output after model steps 1-3 and the ERA5 reanalysis data for the corresponding true moment (i.e., the label).
The components of GAN-UNet are briefly outlined as follows.The generators including generator1 and generator2 used in the model are both the U-Net model (Ronneberger et al., 2015).Similar as Tian et al. (2023), the inputs to the generator1 are composed of the datasets with three variables, namely IVT, zonal wind at 850 hPa (U850), and meridional wind at 850 hPa (V850), which mainly consider that the wind fields at 850 hPa can offer a reliable representative measure of the AR winds as 850 hPa corresponds closely to the central altitude at which the low-level jet stream is typically situated.At the same time, the additional atmospheric variables at other levels in the atmosphere (e.g., 500-hPa) may not show any significant improvement because the information from these upper-level fields (e.g., jet variability) is indirectly captured in the lower-level atmospheric fields to a sufficient degree ().Before entering the model, each variable is first standardized to: (a) eliminate the influence of different variable scales; (b) reduce the impact of extreme values; (c) accelerate the speed of gradient descent for optimal solution.The results 1-3 and the final results are composed of the datasets with only one variable (IVT, U850, or V850) for future time steps.It is noteworthy that the inputs of GAN-UNet are the data at time t, while the results represent the forecasted variable for time t+1, t+2, or t+3.As noted by Bi et al. (2022), we train three individual models for 6-hr, 12-hr, and 18-hr forecast, respectively and iterate to generate the forecasts for subsequent moments, which is called as hierarchical temporal aggregation.For example, when we need to forecast AR events 6 hr ahead, there is no doubt that we should use the model designed for the 6-hr forecast.However, when forecasting AR events 24 hr ahead, we first forecast the state 6 hr ahead and then input this result into the model designed for the 18-hr forecast.When forecasting AR events 48 hr ahead, we input the data twice into the model designed for the 18-hr forecast and then input it into the model designed for the 12-hr forecast.The discriminator used in the model is a variation of the traditional convolutional neural network model.
GAN-UNet model employs a multi-scale loss function that extracts hierarchical features from multiple layers of the discriminator.This loss function captures both long-range and short-range spatial relationships between pixels by utilizing features at different levels of granularity, including pixel-level features, low-level features (such as superpixels), and medium-level features (such as patches).The loss function V used in GAN-UNet model can be defined as follows: G and D represent the generator1 and the discriminator, respectively.E represents the mathematical expectation.x represents the labels, and z represents the input variables of the generator1.G(z) means the results 1 generated by the generator1.D(G(z)) represents the probability that the input of the discriminator is determined to be the labels.The generator1 aims to maximize the value of D(G(z)), while the discriminator aims to maximize D(x) and minimize D(G(z)).Therefore, the objective is to minimize V(D,G) by optimizing the generator1 (min G ) and maximizing V(D,G) by optimizing the discriminator (max D ).
The mean squared error (MSE) is utilized as the loss function after the generator2, which is calculated as of the difference between the forecasts and the labels.The equation of MSE is as followed: where i denotes the longitude-latitude grid index, Ŷi is the label, and Y i is the forecast.
In addition, to assess whether DL can improve the forecast of NWPs, GAN-UNet is combined with the ECMWF forecast via averaging with equal weights, which is same as Chen and Wang (2022).The combined forecast can be expressed as: where i denotes the longitude-latitude grid index, AM i is the average model (AM).

Performance Evaluation
The evaluation metrics used in this study are focused on assessing the accuracy of GAN-UNet in forecasting the location and intensity of the AR events.The evaluation is done from three different aspects.(a) Classification performance: This aspect evaluates whether an AR event occurs in the key regions.This is a dichotomous problem, with only two outcomes: positive (an AR event is present) or negative (an AR event is absent).We use the confusion matrix to represent the classification results (Figure 2).Hence, based on whether the AR event of the forecast or the label occurs, there are four situations at last: ① Hit: the AR event occurrence is forecasted correctly.② False alarm: the model forecast an AR event, but the label does not.③ Miss: the model does not forecast an AR event, but the label does.④ Correct reject: the model correctly forecast an absent AR event.
Three specific indicators, namely precision (P), recall (R), and F1 score are calculated based on these four outcomes.P is for the forecasts, which means the probability of being forecasted correctly to occur in the samples where the AR events are forecasted to occur.That is, P = number of correctly forecasted AR events/total number of events that are forecasted as ARs.R is for the labels, which means the probability of being forecasted correctly to occur in the samples where the AR events actually occur.R = number of correctly forecasted AR events/total number of the AR events.F1 score is defined as a harmonic average of P and R, which helps to find a balance between them.The formulas for the three indicators are shown in Figure 2; (b) Intersection over union (IOU; Tian et al., 2023): This aspect measures the degree of the AR event overlap for the forecasts and the labels in the key regions.IOU is used as the evaluation metric for this aspect, which is defined as the ratio of the intersection area between the forecast and the label to the union area of these two; (c) Intensity difference: This aspect focuses on the differences in intensity between coincident AR events of the forecasts and the labels.This evaluation is only done for the hit outcomes from the classification performance evaluation.

The Importance of ARs to EP
This section investigates the differential impacts of ARs in MP and EP moments within East Asia, as well as the ratios of the AR events across different precipitation bins.The analysis is based on Figures 3a and 3b, which reveal that AR frequency and intensity are highest in the two terrestrial key regions within East Asia.Specifically, AR frequency is considerably higher in EP moments compared to MP moments, particularly in these two key regions.This finding aligns with previous studies (Guan et al., 2023;Kim et al., 2021;Liang & Yong, 2021;Ralph et al., 2006;Slinskey et al., 2020;Waliser & Guan, 2017).Similarly, the AR intensity is also stronger in EP moments relative to MP moments.These results confirm the close association between AR occurrence and EP.
Furthermore, Figures 3c and 3d demonstrate that the AR-related precipitation ratio in EP moments are significantly higher than in MP moments in the two key regions.Specifically, for MP moments, the area-averaged ratios of AR-precipitation are 22.5% and 39.8% in EC and KJ, respectively.However, for EP moments, ARs are associated with 40.5% and 60.6% of total precipitation.It underscores the greater impact of ARs to precipitation in EP moments relative to MP moments within East Asia.
To further support the notion that ARs are associated with EP, the study investigates the ratios of AR events across different precipitation bins (Figure 3e).The total numbers of AR events occurring in EC and KJ are 2,441 and 3,370 between years 1959 and 2022, respectively (not shown in the figure).With the increasing of the precipitation intensity, the ratios of the AR events increase.Overall, the ratios are 1.4% in EC and 0.1% in KJ in dry moments, 7.4% and 7.5% in MP moments (which are calculated as the mean values of the ratios from <10% to 80%-90%), while 32.2% and 31.9% in EP moments, respectively.In summary, this section provides insights into the impacts of ARs in MP and EP moments within East Asia, and highlights the important impact of ARs in EP moments in EC and KJ.It implies that effective AR forecast is meaningful for EP.Therefore, the subsequent sections will assess the forecast performance of GAN-UNet and AM and compare it with the outcomes of NWPs to establish their efficacy.

Forecast AR Event Occurrence
The remaining analyses target the questions related to the occurrence of the AR events, location and intensity difference between the forecasts and the labels, all based on the data from 2008, 2013, and 2018 (the testing datasets as shown in Figure 1).Since there are totally 60 forecast steps, in order to make the presentation of the results more concise, the forecast results are regridded as a daily temporal resolution.
The first question addressed is to access the frequency of correctly forecasting an AR event when one actually does occur.Figure 4 shows the scoring results of R, P, and F1 score.Overall, GAN-UNet and AM forecast the likelihood of the AR event occurrence well (basically greater than 0.5) within 5 days, and the forecast effect gradually deteriorates over lead times, similar to the NWPs.For the sake of convenience in description, we refer to the lead times of the first 5 days as "the early lead times" and the lead times after 5 days as "the late lead times".In East Asia, at the early lead times, the forecast performance of AM is the best, the score of the GAN-UNet is slightly lower than ECMWF, but better than NCEP and ECCC.However, the R, P, and F1 score of all the models in this study exhibit inadequate performance in both EC and KJ at the late lead times.The score detail in both two key regions is shown following.
In EC, AM always performs best at the early lead times, followed by GAN-UNet and ECMWF.The 5-day average F1 scores at the early lead times for the 5 models from high to low are 0.755 (AM), 0.712 (ECMWF), 0.712 (GAN-UNet), 0.7 (NCEP), and 0.632 (ECCC).By analyzing the spatial images (not shown), it is observed that the AR events incorrectly forecasted by ECMWF and GAN-UNet tend to differ at the early lead times and this difference is attributable to minor deviations in IVT.AM mitigates the deviations by computing their mean values, which enhances the overall forecast accuracy.At the late lead times, both AM and GAN-UNet perform notably better than NWPs, but the performance of AM is not as good as GAN-UNet.We think that the reason why GAN-UNet significantly performs better at the late lead times is mainly due to the use of the hierarchical temporal aggregation strategy, which effectively avoids the accumulation of errors in the iterative forecasting process (Bi et al., 2022).Meanwhile, As the average of the GAN-UNet and ECMWF forecasts, the performance of AM falls between the two.As shown in Figure 4e, the inferior performance of AM compared to GAN-UNet could be attributed to the F1 score of ECMWF falling below 0.5, indicating that ECMWF's forecast may have a notable error.At the same time, while GAN-UNet performs best after 5 days, its F1 score is still below 0.5, suggesting that after 5 days, none of the models could maintain accuracy in EC.While in KJ, all the models perform better than in EC within all the lead times.AM performs best at the early lead times, and its performance is on par with GAN-UNet at the late lead times.However, both outperform the NWP models selected in this study.This may be due to the fact that even if ECMWF judgment is incorrect, the IVT differences are still within an acceptable range, so the difference can be reduced by averaging with GAN-UNet within all the lead times.The 5-day average F1 scores at the early lead times for the 5 models from high to low are 0.845 (AM), 0.794 (ECMWF), 0.791 (GAN-UNet), 0.71 (ECCC), and 0.698 (NCEP).With the lead times progress, the R, P, and F1 scores decrease in KJ.
Overall, it indicates that GAN-UNet and AM can accurately forecast the occurrence of AR events in the early lead times, while AM performs best among all the 5 models in this study.Compared with the results in the early lead times, all the 5 models cannot maintain a high accuracy on forecasting the AR event occurrence in the late lead times.

Forecast AR Event Position
To further illustrate the network forecast capability, we second consider the issue of the overlap of the AR events between the forecasts and the labels occurring in the key regions.Recalling the occurrence-based outcomes, a model may forecast an AR event correctly for a moment, but its location may differ from the label.To better quantify the error in the location of the AR events, IOU is used to evaluate the degree to which the forecasts and the labels coincide with those AR events that are correctly forecasted.
As presented in Figure 5, at the early lead time, the 5-day average IOUs of the 5 models from high to low are 0.706 (AM), 0.675 (ECMWF), 0.645 (GAN-UNet), 0.603 (NCEP), and 0.555 (ECCC) in EC; and 0.688 (AM), 0.640 (ECMWF), 0.613 (GAN-UNet), 0.606 (NCEP), and 0.544 (ECCC) in KJ.The results indicate that AM outperforms the other models, with ECMWF and GAN-UNet following closely behind, while NCEP and ECCC exhibit the relatively poor performance at the early lead time.Similar to the occurrence of the AR events, as the lead times progress, none of the models can maintain high IOUs, most of which are below 0.5.

Forecast AR Event Intensity
The third crucial aspect of effective AR event forecast is the correct forecast of the AR event intensity.Even if a model can accurately forecast the AR event location, a large error in the intensity may still occur.Figure 6 evaluates the AR event intensity difference between the forecasts and the labels for various lead times.The radar chart facilitates the assessment of the forecast skill for individual lead time, with a polygon's perimeter closer to 0 (black bold curve) indicating better performance.
To quantify the intensity difference between the forecasts and the labels, while avoiding the effect of the average difference over multiple lead times that may lead to a reduced average difference of positive and negative values, the mean absolute value of the intensity difference at the early (late) lead times is calculated.For example, assuming during the lead times of 1-5 days, the intensity differences of ECMWF are 0, 15, 15, 10, and 10, then its average intensity difference is 0. However, its mean absolute value is 10.The mean absolute values of the intensity difference of the 5 models at the early lead times are respectively 12.219 (ECMWF), 3.612 (NCEP), 25.011 (ECCC), 36.373(GAN-UNet), and 39.645 (AM) in EC; and 11.944 (ECMWF), 13.088 (NCEP), 11.348 (ECCC), 48.473 (GAN-UNet), and 49.682 (AM) in KJ.While at the late lead times, the values are 27.586(ECMWF), 13.836 (NCEP), 44.828 (ECCC), 58.962 (GAN-UNet), and 100.417 (AM) in EC; and 16.609 (ECMWF), 20.269 (NCEP), 18.785 (ECCC), 65.842 (GAN-UNet), and 99.650 (AM) in KJ.It is evident that GAN-UNet and AM show poor forecast results than the other models, with a severe underestimation of the labels' intensity.With the lead times going on, AM forecasts become increasingly severely underestimated, making it the worst-performing model.As mentioned earlier, errors in the models are likely to be larger at extreme values, including the maximum and minimum.However, since in the process of AR detection, the low values of IVT are removed (only consider IVT >500 kg m 1 s 1 ), the largest IVT values may have the most significant impact on the AR event forecast.In other words, a smaller forecast of the maximum values is an important reason for the smaller AR event intensity forecast.To test this hypothesis, we divide all pixels that correctly forecasted AR events into different intensity bins to compare the difference between the forecasts and the labels, as shown in Figures 7 and 8.
To investigate the errors of AR event intensity across varying forecast lead times, the initial, middle, and final lead times, specifically representing 1-, 7-, and 15-day are chosen, respectively.As shown in Figure 7, at 1-day lead time, the difference between the forecasts and the labels across different IVT bins is relatively small, especially in the case of ECMWF and AM.The box plot length of AM is smaller than that of ECMWF, indicating that the forecasts of all the grids in AM is closer to the labels.However, in the maximum IVT bin, the forecast value of AM is low, which may be attributable to GAN-UNet's underestimation of the final bin.This finding helps to explain why, at the early lead times, the average IVT forecast values for GAN-UNet and AM are lower than the labels (Figure 6).NCEP forecasts are higher for small IVT bins, and lower for large bins, accounting for its small total IVT differences.Conversely, ECCC forecasts for all bins are generally higher, consistent with the larger forecasts reported in Figure 6.
At 7-or 15-day lead time, all the 5 models show larger differences across all bins, including both mean values and standard deviation.For ECMWF, the mean values of the forecasts are higher for small IVT bins, and lower for large bins, resulting in small total IVT differences (Figures 6a and 6f).The forecasts for GAN-UNet and AM models in different bins are similar to those of the ECMWF model, but the standard deviation is smaller and the underestimation of the largest bin is more pronounced.This could be the reason for their poor performance in Figure 6.Moreover, in the minimum IVT bin, AM has the lowest difference, perhaps explaining why it is the best at forecasting AR event occurrence and position.In addition, the length of the box plot of AM is also the smallest among all bins, indicating the minimal difference fluctuation.For NCEP and ECCC, the forecasts are higher than or equal to the labels for all bins, and the differences of the mean of the two decreases with increasing IVT intensity, except for the largest bin where the forecasts are lower than the labels.The models in KJ behave similarly to those in EC, except that the box plot lengths are larger, indicating greater variability in forecast-label differences across bins.
To further verify the possibility that GAN-UNet and AM's poor performances in Figure 6 is mainly due to its small forecasts in the final bin, we conduct a sensitivity experiment by replacing all the forecast and label grids with 825 kg m 1 s 1 when IVT > 825 kg m 1 s 1 at the corresponding grids in the labels.The results in Figure 9 show that GAN-UNet forecast is slightly larger, and the errors of AM is minimal, the forecasts of ECMWF, NCEP, and ECCC are all on the large side, which confirms the results in Figures 7 and 8. Based on the preceding analysis, it can be inferred that the conclusion that even though GAN-UNet and AM forecast the AR event  occurrence and location accurately, its forecasts of the maximum values are severely underestimated.In summary, as lead times advance, the intensity differences between the forecasts and the labels of all the 5 models increase.Additionally, each model exhibits unique patterns.Notably, the AM mean values of the differences are the least, while those of the GAN-UNet model are slightly greater than those of ECMWF but smaller than those of NCEP and ECCC, with the exception of the final bin.
It is worth mentioning that compared to the mean of the differences, RMSE does have some advantages in evaluating the gap between forecasts and labels.The major distinction lies in that, if only the mean of differences is used to assess the disparity among multiple data points, positive and negative errors may offset each other.However, as illustrated in Figures 7 and 8, employing the mean of differences can demonstrate whether the overall forecast values are biased toward being larger or smaller relative to the labels, which is precisely what we aim to convey.To demonstrate that small differences are not due to the cancellation of positive and negative errors, we have calculated the RMSE, and the corresponding figures are provided in Figures S1 and S2 in Supporting Information S1.
In summary, the forecast results of AR occurrence accuracy, AR shape and intensity show that AM has the best forecast effect on AR events, followed by ECMWF, GAN-UNet, NCEP, and ECCC.One potential reason why NCEP and ECCC perform worse than the ECMWF could be the observation gaps for ARs.NCEP GFS assimilates less data than ECMWF does, causing poor initial conditions.Such deficiency in initial analysis can be projected to poor forecast skills (Geer et al., 2018;Zheng et al., 2021).

Cases of the Forecast Results
To demonstrate the forecasts more intuitively and confirm the effectiveness of the forecasts, Figures 10 and 11 present two cases of the AR event forecast results generated by different forecast models occurring in EC and KJ, respectively.Selection is subject to the following two criteria: First, remove the uncontroversial AR forecast events, that is, the 5 models mentioned in the study can forecast AR events well.The reason is that we mainly want to show the variability of the models' forecast of controversial events.Second, we do select two cases where GAN-UNet performs better.Through the observation of the forecasted events, we conclude that among controversial forecasted events, GAN-UNet tends to false alarm the AR events that do not occur, while ECMWF tends to miss the events that occur.In general, missing is more harmful than false alarming.In order to highlight the advantages of AM and GAN-UNet methods, we select two occurring events to demonstrate the performance of our proposed method.In the second case, AM and GAN-UNet correctly forecast its occurrence, while both ECMWF and NCEP miss it at the first lead time.
Both the figures indicate that the uncertainty of the forecast and the variability increase as the lead times progress.Specifically, in Figure 10, all forecast results display an IOU above 0.8 when the lead time is 1 day, indicating a relatively accurate forecast of the AR event.The IOUs, ranked from largest to smallest, are AM (0.87), ECMWF (0.86), NCEP (0.844), ECCC (0.838), and GAN-UNet (0.81).The spatial maps also show a good forecast of the  6, but all the forecast and label grids are replaced with 825 kg m 1 s 1 when IVT > 825 kg m 1 s 1 at the corresponding grids in the labels.AR event intensity.However, when the lead times extend to 4 days, all the 5 models' IOUs drop to approximately 0.6, while the performance of AM is still the best.When lead time is 7 days, although all the 5 models forecast the AR event occurrence correctly, the results for all the models except GAN-UNet differ significantly from the label, with the shape and intensity of the AR event shifting.Despite some models forecasting the event beyond 10-15 days, the shapes are markedly disparate with the label and can be deemed as random results.In Figure 11, the AR event occurring in KJ is depicted.When the lead time is 1 day, AM, GAN-UNet, and ECCC correctly forecast the event, with AM having the highest IOU (0.897), followed by GAN-UNet (0.86) and ECCC (0.827).ECMWF and NCEP fail to forecast the AR event.This may be attributed to the definition of the AR event, that is, the AR event needs to cover more than 50% of the key region.Specifically, ECMWF and NCEP may forecast the occurrence of the AR event in East Asia, but since it covers less than 50% areas in KJ, we consider the AR event not to occur there.When the lead time is 4 days, 5 models all forecast the AR event accurately, with AM performing best (0.799), followed by GAN-UNet (0.738), NCEP (0.701), ECCC (0.682), and ECMWF (0.668).Overall, the IOUs obtained by different models fluctuate slightly in a single case, with GAN-UNet and AM consistently demonstrating superior performance, frequently surpassing the state-of-the-art NWP model, especially at the early lead times.However, the reliability of all the 5 models' performance deteriorates when the lead time continue.
From these two cases, it can be seen that the forecast performance of GAN-UNet is comparable to that of NWPs, and the performance of AM is the best, particularly at the early lead times, which is consistent with the results presented earlier in this paper.

Discussion and Conclusions
In this paper, we have developed a DL model (GAN-UNet) to forecast the AR events by forecasting the full-field IVT of East Asia with three inputs (IVT, U850, and V850).We make two key improvements to the model architecture: (a) develop a stacked GANs consisting of two generators and a discriminator to better capture the details of the forecasts, and (b) use the strategy which is called hierarchical temporal aggregation to avoid the error accumulation in the process of multiple forecasts.In addition, in order to better using DL to improve the forecast accuracy of NWP, we have weighted average the results of GAN-UNet and ECMWF models and named it as AM.
GAN-UNet successfully forecast the occurrence, position, and intensity except for the largest IVT bin of the AR events in EC and KJ, with comparable accuracy to the state-of-the-art NWP models achieving in TIGGE.Furthermore, for the forecast of the AR event occurrence and position, AM outperforms all the models used in this study during the early lead times, indicating that GAN-UNet can improve the results of NWP model.Specifically, at the early lead times, in the occurrence and position forecast, AM performance is the best, GAN-UNet outperforms NCEP and ECCC but slightly lags behind ECMWF.At the late lead times, both GAN-UNet and AM are notably better than other NWP models, although the results do not maintain sufficient accuracy.For the AR event intensity forecast, we find that in all the 5 model results, the forecasts for small values are overestimated and the forecasts for large values are underestimated.However, GAN-UNet and AM forecasts show greater consistency and accuracy across all bins, except for the largest bin, where the differences are high for all the 5 models.After replacing the values of the largest bin with a fixed value, GAN-UNet and AM have the least differences in their intensity forecasts compared to the labels.
In general, DL presents several advantages over NWPs.Specifically, (a) the trained DL model can emulate specific modules or processes of NWP models, enhancing accuracy and timeliness.(b) Detection of extreme cases holds paramount importance for disaster prevention and emergency decision-making.Data-driven methods within DL can provide forecasts within minutes of receiving new data, potentially better suiting the requirements of highly responsive forecasting services compared to traditional theory-driven NWPs.(c) Supervised and semisupervised DL can overcome the limitations of threshold-based conventional detection approaches (Ren et al., 2021;Tian et al., 2023).However, DL is known to have some well-recognized drawbacks, such as poor interpretability and a lack of physical constraints, areas currently receiving focused attention in the DL community.Additionally, addressing corrections/improvements for extreme cases from Machine Learning and DL methods poses challenges, as evident in existing research (Johnson & Khoshgoftaar, 2019;Moniz et al., 2018;Ribeiro & Moniz, 2020).The fundamental challenge in DL-based extreme cases forecasting arises from the rarity of such cases, a phenomenon known as class imbalance, representing one of the major challenges in machine learning and DL.Highly imbalanced data introduces complexity, as most learners tend to exhibit bias toward the majority class, potentially overlooking the minority class, especially in extreme cases.This study could be improved in a number of ways.First, while we have identified that AM outperforms ECMWF and GAN-UNet in forecasting the occurrence and position of AR events, we have not quantitatively analyzed the reasons for this phenomenon.Given the varying gaps between GAN-UNet and ECMWF forecasts and the labels, it may be beneficial to consider different weighting strategies.Second, all the 5 models exhibit poor performance in forecasting the maximum values of AR events, which are often associated with severe precipitation and have significant implications for human activities.This may be attributed to the highly imbalanced of the input data and the extensively debated limitation of neural networks: spectral bias (Mojgani et al., 2023).Future studies may consider incorporating algorithms for forecasting extremely rare events (Mojgani et al., 2023;Pickering et al., 2022) to improve the forecast of maximum values in AR events.
Future work is expected to discuss the challenges of accurately forecasting the IVT maximums in AR events, which are common difficulties for most models.We emphasize the importance of accurate forecasting of extreme values for preventing major disasters.Besides, in addition to AR events, we are also concerned about the forecast of AR-related precipitation.In the next work, we hope to get better forecast results of AR-related precipitation and check if they are highly correlated with AR intensity.Additionally, future work will consider post-processing or downscaling of existing forecast data, which would have important implications for disaster prevention and management.

Figure 1 .
Figure 1.The overall forecast process of GAN-UNet model.GAN-UNet contains three input and output layers (i.e., IVT, U850,and V850).Since the inputs of a single model is all the three variables and the output is one variable, all three variables must be forecasted separately by running the model for each variable.

Figure 2 .
Figure 2. Confusion matrix of the classification results and the expressions of P, R, and F1 score.

Figure 3 .
Figure 3. Seasonal-mean (MJJA) climatology of AR frequency (%, shaded), IVT (kg m 1 s 1 , contoured), and IVT ̅̅→ (kg m 1 s 1 , vectors) in (a) MP moments and (b) EP moments; (c)-(d) are same as (a)-(b), but for the ratio (%, shaded) of AR-related precipitation to total precipitation; (e) The ratio of the AR events in different precipitation bins in EC and KJ.The reference domains for EC and KJ are boxed in (a)-(d).

Figure 5 .
Figure 5. Mean IOUs at (a) EC and (b) KJ with lead times for all the selected models.Shading indicates the 95% confidence intervals of the IOUs.

Figure 6 .
Figure 6.The radar charts of the AR intensity difference (kg m 1 s 1 ) between forecasts based on (a), (f) ECMWF, (b), (g) NCEP, (c), (h) ECCC, (d), (i) GAN-UNet, and (e), (j) AM and the labels over EC and KJ.The numbers outside the circles represent lead times (day).The numbers inside the circles represent the value of AR intensity difference corresponding to each circle.The black bold curve represents a value of AR intensity difference of zero.

Figure 7 .
Figure 7. Box-and-whisker plots binned by IVT (kg m 1 s 1 ) of (a-c) ECMWF, (d-f) NCEP, (g-i) ECCC, (j-l) GAN-UNet, (m-o) AM at lead time = 1, 7, and 15 days in EC.The box covers the interquartile range, the black circle inside the box is the mean values, and the whiskers indicate the range (i.e., less than 1.5 times the interquartile range above or below the upper or lower quartile, respectively).The mean values of the initial, middle, and final bins are represented by black values.

Figure 8 .
Figure 8.As in Figure 7 but in KJ.

Figure 9 .
Figure 9.As in Figure6, but all the forecast and label grids are replaced with 825 kg m 1 s 1 when IVT > 825 kg m 1 s 1 at the corresponding grids in the labels.

Figure 10 .
Figure 10.Case study for 02 Jul 2013 1200 UTC with one AR event occurring in EC with (a) the label, and forecasts of (b) ECMWF, (c) NCEP, (d) ECCC, (e) GAN-UNet, and (f) AM.Each column represents the results generated by 1 day, 4 days, 7 days, 10 days, 13 day, and 15 days before the occurrence of this case.For example, the second column data in (b) represents the forecast result obtained at four days before 02 Jul 2013 1200 UTC (i.e., 30 May 2013 1200 UTC) as the initial forecast time.

Figure 11 .
Figure 11.As in Figure 10, but for case study for 27 Jun 2018 1200 UTC with one AR event occurring in KJ.

Table 1
Categories of Precipitation Days