Global Precipitation Correction Across a Range of Climates Using CycleGAN

Accurate precipitation simulations for various climate scenarios are critical for understanding and predicting the impacts of climate change. This study employs a Cycle‐generative adversarial network (CycleGAN) to improve global 3‐hr‐average precipitation fields predicted by a coarse grid (200 km) atmospheric model across a range of climates, morphing them to match their statistical properties with those of reference fine‐grid (25 km) simulations. We evaluate its performance on both the target climates and an independent ramped‐SST simulation. The translated precipitation fields remove most of the biases simulated by the coarse‐grid model in the mean precipitation climatology, the cumulative distribution function of 3‐hourly precipitation, and the diurnal cycle of precipitation over land. These results highlight the potential of CycleGAN as a powerful tool for bias correction in climate change simulations, paving the way for more reliable predictions of precipitation patterns across a wide range of climates.


Introduction
Throughout the history of atmospheric model development, results from fine-grid models that resolve important physical processes like cloud and precipitation formation or flow over mountain ranges have been used to improve biased climates in coarse-grid models that do not.For instance, scientists have used large-eddy simulations as a test-bed for calibrating analytic turbulence and cloud parameterizations, for example, Bogenschutz et al. (2010).This process relies heavily on expert knowledge to develop appropriate models of sub-grid behaviors, often heavily influenced by analysis of a few archetypical cases.
More recently, machine learning (ML) has been used to correct coarse-resolution models' behavior across the full range of conditions over historical periods where observational analysis is available.For example, Watt-Meyer et al. (2021) trained a corrective tendency for a 200 km grid atmospheric general circulation model (AGCM) using ML based on nudging tendencies toward observational analysis.The ML correction reduced annual-mean precipitation biases by 20%.An ML approach based on reservoir computing (Arcomano et al., 2023) and ERA5 reanalysis (Hersbach et al., 2020) halved the global root mean square bias of annual-mean precipitation in an even coarser (400 km grid) AGCM.
We require a different strategy when training a model to generalize across future climate forcings, as when training with observational analyses one can only learn the climate represented by this data.One method is to use finer-grid AGCM simulations as training targets.Such simulations are computationally expensive, but they more accurately simulate societally-important aspects of present-day climate, such as means and extremes of land surface precipitation and temperature, than do coarse-grid AGCMs (Flato et al., 2013;Wehner et al., 2010).Because they resolve much more detail of deep convective storm systems, orography and land surface characteristics, they are less sensitive to uncertain parameterizations of deep convection and orographic drag, making them potentially a more robust simulation tool for generalizing to future climates.S. K. Clark et al. (2022) used the same nudging approach as Watt-Meyer et al. ( 2021) to ML-correct a 200 km model to behave like its 25 km analog across four climates forced by adding specified uniform sea-surface temperature (SST) increments to observed SST patterns.They were able to correct spatial patterns of precipitation over land by 10%-30% in multiyear simulations across all four climates.This bias reduction is encouraging, but to get full advantage from high-fidelity reference data, corrective ML should enable both the weather and climate to have much reduced bias (much less than 50% of a no-ML baseline simulation) versus this reference, both for means and extremes of salient quantities such as precipitation.Yet fundamental challenges persist.This type of hybrid ML, which couples a bias correction model trained offline with a pre-existing AGCM, can induce online simulation biases due to feedbacks between these two components (Brenowitz et al., 2020).The machine learning goal of minimizing prediction error for each sample can lead to difficulties in accurately representing small-scale stochastic behaviors such as deep convection, leading for example, to an inaccurate representation of the frequency of extreme precipitation (Kwa et al., 2023).
To further reduce bias in the simulated space-time distribution of precipitation versus a reference climatology, we turn to a different form of ML, the Cycle-generative adversarial network or CycleGAN (Zhu et al., 2017), which is a promising tool for translation of image data between two unpaired domains.Unlike the above hybrid ML approaches, this is a post-processing approach that cannot easily be analyzed in terms of physical process errors in the coarse-grid model, and only corrects selected model fields (precipitation, in our case).In the past, cyclegenerative networks have been effectively used for offline bias correction.Several studies found it necessary to augment the cycle-generative network with quantile mapping (QM) to achieve accurate probability distributions of precipitation (François et al., 2021;Fulton et al., 2023;Pan et al., 2021), with one more recent study achieving success without quantile mapping (Hess et al., 2022).These works focused on correcting daily-mean precipitation toward an observational analysis over a historical period.
Our work expands on these efforts.We use the original CycleGAN architecture of Zhu et al. (2017) to correct the output of the FV3GFS atmospheric model with a C48 cubed-sphere grid (with approximately 200 km horizontal spacing) to behave like the coarsened output of the same model run on a C384 cubed-sphere grid (25 km spacing).We demonstrate the ability to improve both the spatial distribution of annual-mean precipitation and the cumulative distribution function (CDF) of 3-hourly precipitation up to the 99.999th percentile across a range of climate forcings without using QM.Using a high-resolution simulation as the target allows us to train on climates not seen in the historical record.This method is capable of correcting data at intermediate climate forcings not used during model training, enabling its application to climate change simulations.

Data Set
We generate all training data using the FV3GFS atmospheric model (Harris & Lin, 2013;Putman & Lin, 2007;Zhou et al., 2019) as described in McGibbon et al. (2021), run on a cubed-sphere grid with 63 vertical levels.Annually-repeating cycles of sea surface temperature (SST) and sea ice are defined based on the observational monthly means time-averaged from 1982 to 2012 from the 1/12°Real Time Global Sea Surface Temperature (Thiébaux et al., 2003) and 0.5°Climate Forecast System Reanalysis (Saha et al., 2014) data sets, respectively.We perturb the SSTs by adding globally-constant offsets of 2, 0, +2, and +4 K to produce four different sets of forcings while maintaining the present-day annual cycle of sea ice and carbon dioxide concentration, analogous to S. K. Clark et al. (2022).We train using simulations with spacing between SST offsets of 2 K out of concern that precipitation may be too different between forcings at the larger 4 K spacing used in S. K. Clark et al. (2022) for the trained model to accurately generalize to intermediate forcings, though this has not been tested.
For each of these SST forcings, a simulation was performed at C48 resolution for 9 years, 1 month.Eight 1 year, 1 month simulations are performed at C384 resolution beginning with the C48 model snapshot state 1 year into the C48 run as well as the state every year thereafter; the C48 snapshots were converted to C384 initial conditions using the chgres_cube tool of UFS_UTILS (Gayno et al., 2020).For each of these C384 simulations, the first month of simulation time is discarded as a model spin-up period.This yields 8 years of useful simulation data from each climate, from which we take the first 5 years as training and the last 3 years as validation data.
During these simulations we accumulate and store the 3-hourly mean precipitation rate.We use 3-hourly precipitation instead of daily mean to test the ability of the model to correct biases in the diurnal cycle.At each output time, the C384 precipitation fields are coarsened to the C48 grid by horizontal averaging so that they can be directly compared with coarse-grid precipitation fields.
A. Kwa, O. Watt-Meyer, W. A. Perkins, C. S. Bretherton We also perform "ramping" simulations at both C48 and C384 resolution, which begin with a present-day initial condition and three month spin-up period with 0K forcing, and then enter a period where the forcing is linearly increased from 0K to +2K over the course of 4 years.This data is withheld during training and hyperparameter tuning, and is used for model evaluation only.It tests whether the CycleGAN can skillfully interpolate mean and extreme precipitation patterns between climates on which it was trained.

Model Formulation and Training
The model architecture in Zhu et al. (2017) is used with minimal modifications to allow processing of cubedsphere data.Specifically, convolution is performed using halo updates on the cubed sphere, where missing corners are filled with zero values.This is numerically identical to the convolution approach used in Weyn et al. (2020), except that data in the corner of each tile domain is set to zero rather than copying and rotating data from the polar tile face.We do not find evidence of corner imprinting despite this choice.See Figure 3  The performance of this model is improved by concatenating spatiotemporal geometric features to the input of the generator and discriminator models.These features are the x, y, and z positions of each grid cell on a stationary unit sphere in 3-dimensional Euclidean space (spatial features), as well as the x and y positions of each grid cell on a unit sphere in Euclidean space as it rotates with a period of one rotation per day (time features).These time features can also be thought of as the x and y positions of an hour hand on a 24-hr clock indicating the local time, multiplied by cos(latitude) to avoid discontinuity at the poles.These are used only as inputs of these models, and are not output by the generators.The discriminator is given identical geometric features to the generator which produced the image being evaluated.
The training data set includes 58,400 3-hourly global snapshots, split evenly across the four climate forcings.Each epoch, we randomly sample 40,000 snapshots with replacement, training with a batch size of 1.This data only contains two-dimensional cubed-sphere surface precipitation rate along with a UTC time, which is used to generate the spatiotemporal geometric features.
Notably, the climate forcing itself is absent from the training data, as we were able to achieve excellent bias correction without it.When we included the SST perturbation as input context, as was done for diurnal features, several performance metrics worsened without any clear improvements (compare red vs. steel-blue colors in Figures S2 and S3 in Supporting Information S1).
The model was trained with an exponential learning rate decay.Starting with a high learning rate and eventually reducing it further in training is a widely used technique in machine learning (Li et al., 2019).The best results shown here were achieved with an initial learning rate of 10 4 and a decay factor of 0.63 (a tenfold decrease every five epochs).Training converged (in terms of our precipitation bias metrics) by epoch 14 and was run for 16 epochs (Figure S1 in Supporting Information S1).
Otherwise, the hyperparameters are the same as in the 6-layer network of Zhu et al. (2017), but with twice as many filters in the generator and discriminator.We did not attempt to tune the number of layers, activation functions, or choice of optimizer, and we found that increasing the number of filters beyond the value used did not improve the model.
Training was performed on a single Tesla T4 GPU in Google Cloud, and evaluation was performed using one CPU of a n1-highmem-8 Google cloud VM instance.The trained model is approximately 410MB in size.Inference speed for the CycleGAN translation was approximately 3.4 s per simulated day, while the C48 model used to generate its input took 180 s per simulated day on 24 CPU cores in Google Cloud, and the high-resolution target took 325 s per simulated day on 1,536 cores on the Gaea supercomputing system at the National Oceanic and Atmospheric Administration (NOAA) Geophysical Fluid Dynamics Laboratory (GFDL).
We also perform a comparison to a quantile mapping (QM) baseline model.We used the QuantileTransformer from scikit-learn v1.0.2 (Pedregosa et al., 2011) fit on the combined cross-climate training data.We use default settings other than to set the subsample parameter to a number greater than the total size of our training data set.We use this model to translate the C48 data into quantile values, and then translate those to their values from the C384 distribution.

Results
Figure 1 shows the behavior of the generative model on a single sample, taken from the ramping simulation in a climate state distinct from any in the CycleGAN training data set.We are most concerned with the translation of C48 data into C384 (ML) data (upper left vs. upper right panels), but it is also illuminating to see the inverse generation from C384 to C48 (ML) (lower left vs. upper right panels).The model introduces finer scale features when translating into the C384 domain, especially in lightly precipitating marine boundary layer cloud regimes.It strengthens precipitation over land, introducing precipitation into areas which have none in the C48 input, for example, over the South American continent.
The translation substantially improves the mean precipitation climatology versus C48 simulations for all four SST offsets, as shown in Figure 2, with metrics reported in Figure 3.Here and throughout this analysis, time-mean and  CDF statistics for fixed-SST simulations such as for the 0K climate are computed on the 3 years of validation data.Statistics for the ramping climate are computed on years 2 and 3 of the 4-year simulation linearly ramping from 0K to plus-2K forcings, to highlight the range of SST offsets that are further from the fixed-SST training data and hence provide a more rigorous out-ofsample test.The bias reductions seen in the ramping simulation are comparable to those in the target climates; the bias of mean precipitation averaged over all land is reduced over 85%, and the standard deviation of the geographic pattern of time-mean bias is reduced over 75% to values around 0.5 mm/d.A significant portion of the biases in each target climate is explained by differences in precipitation between the validation and training data sets, as shown by the "train" bars.Thus, we anticipate further bias reduction with larger training and validation data sets.
Both the 0 K and ramping simulations have smaller precipitation pattern biases than reported for a current-climate case by Arcomano et al. (2023).They reported that their hybrid reservoir computing ML reduced the standard deviation of precipitation pattern bias nearly 50% from 1.2 mm/d in their no-ML baseline model to a value of 0.63 mm/d.Our mean precipitation biases also much smaller than the bias shown in Figure 1 2021) is difficult because they considered France and the continental United States, respectively, both of which have much smaller biases than the rest of the globe in our model.
Figure 4 shows that the CycleGAN-translated data has a 3-hourly probability distribution and a diurnal cycle of land precipitation that much more closely match the C384 reference data across all climates, including the ramping simulation.We might expect the CycleGAN to struggle to represent extreme precipitation events and their sensitivity to climate forcing because they appear infrequently in the training data.Nevertheless, the 58,400 precipitation fields, each containing 13,824 atmospheric column, comprise almost 10 9 atmospheric columns, perhaps enough to learn how to translate even highly unusual precipitation events.Indeed, the CycleGAN improves the accuracy of the CDF of precipitation up to the 99.999th percentile.Only at the 99.9999th percentile and only for the 2 K forcing, the CycleGAN slightly increases the error over the input C48 reference data.Surprisingly, the distribution of ML outputs is, if anything, over-dispersive in the tails.The shape of the diurnal cycle of precipitation over land is also improved across all climate forcings, with a stronger trough and sharper increase in precipitation from 6:00 to 15:00 local solar time, and more sustained precipitation through the 21:00-24:00 bin.
The CDF correction performed by the CycleGAN is even competitive with QM and with the training data (representing sampling variability) across many percentiles and climates.QM does an excellent job correcting the data distribution despite being fit across all climates, but it fails to correct the mean bias of precipitation over land.This is because the C48 model has only a small bias in global precipitation.The land bias is an error in its regional distribution which QM is not able to correct (see also Figure S2 in Supporting Information S1).QM similarly cannot correct the shape of the diurnal cycle.
While the closest match of the translated CDF occurs in the ramping simulation, there is no reason to expect this from our methodology, and this result may be due to random chance.We do note that the C384 (ML) has an excess of extreme values across all climates.The CDF of the C48 ramping simulation used as input has a lower occurrence of extreme values than either the 0K or the +2K simulations, which could have a competing effect resulting in a closer CDF match of the C384 (ML) outputs.
Here, the diurnal cycle over land was computed by determining the local solar time in each land-based grid cell for each sample based on its longitude, and then binning the data across local time before taking an area-weighted mean.

Sensitivity Studies
This section describes sensitivity studies that help motivate some of our model design choices.We initially trained the CycleGAN model with less data, but found the global maps of time-averaged precipitation vary significantly from year to year, resulting in biases in the trained model as a result of under-sampling the long-term climate.When we train the model using only the first year of data from each climate and evaluate on the same 3 years of validation data, the model has significantly worse time-mean biases (Figure S2 in Supporting Information S1, compare light purple bar to darker blue bar) and does a worse job predicting the output CDFs, overpredicting the extremes of each climate's precipitation distribution (Figure S3 in Supporting Information S1).
Adding spatiotemporal features defining the diurnal cycle as context to the input of the generator and discriminator was crucial for correcting the shape of the mean diurnal cycle of precipitation over land.Without these features, the land diurnal cycle of the C384 (ML) output data is still significantly improved because we have corrected the mean and variance, though the shape of the cycle (light green and light blue) is more similar to the C48 values (orange).Surprisingly, the inclusion of these features has little impact on the standard deviation of the geographically-resolved time-mean bias (Figure S2 in Supporting Information S1).
In Zhu et al. (2017), an identity loss was included for certain translation tasks to avoid unnecessary modification of the color scheme during translation.We find removing this identity loss generally degrades model performance.It leads to increased pattern bias and land-mean bias in precipitation (Figure S2 in Supporting Information S1) and has a neutral effect on the CDF and land diurnal cycle (Figure S3 in Supporting Information S1).

Discussion
Similarly to Hess et al. (2022), we were able to use cycle-generative architectures to match the PDF of 3-hourly precipitation without QM.Unlike QM, this approach additionally improves the spatial distribution and diurnal cycle of precipitation.
It is worth some discussion of why multiple studies found GANs unable to correct the PDF of precipitation on their own (François et al., 2021;Fulton et al., 2023;Pan et al., 2021) unlike here and in Hess et al. (2022).Pan et al. (2021) claimed that QM is needed because "GANs are trained to produce individual trust-worthy samples, not accurate probability distribution estimations," due for example, to mode collapse (Bau et al., 2019).It is true that GANs suffer from mode collapse on image generation tasks, for example, images with chairs are completely unlike images without chairs and belong to a detached output distribution, but the architecture itself is designed to produce accurate PDFs (Goodfellow et al., 2014).Unlike chair images, for any realistic precipitation state there exists another similar realistic precipitation state that is slightly more likely to occur, and in this sense precipitation is not modal.However, this idealized view might not hold for a training data set with insufficient size, especially if multiple variables are involved due to the curse of dimensionality.Many methodological differences might explain why we were able to better simulate the probability of extreme precipitation events without QM.We correct only precipitation, without using other model output fields as dynamical constraints (Pan et al., 2021) or additional fields to be corrected (François et al., 2021;Fulton et al., 2023).In addition, our model is trained on more data than the previous studies with PDF biases.We used 58,400 timesteps each with 13,824 gridcells, resulting in 807M precipitation samples, while While this CycleGAN significantly improves the climate of individual samples from a spun-up C48 model state, it should not be used to correct weather simulations run at C48 which are initialized from a coarsened C384 state.We trained the CycleGAN only on samples which are far into a C48 simulation, whose climate contains more significant biases than a hypothetical data set containing samples from the first week of a C48 simulation initialized from coarsened C384 data.One could remove this input bias effect by training a CycleGAN model to correct model biases at one particular forecast lead time, and using coarse and fine-grid examples at that particular lead time.One could also train a conditional CycleGAN with forecast lead time as a model input capable of correcting a variety of lead times, similar to what was done in this work for time-of-day.
One limitation of this approach is that it seeks only to reproduce the climate of a reference high-resolution model.Where the goal is to reduce climate biases relative to the true future climate, it can only do so to the extent that the high-resolution model itself has lower biases.As there are presently no observations of the future, the accuracy of climate projections from high-resolution models is a complex research question which lies outside the scope of this study.

Conclusions
We found that CycleGAN with little modification can accurately translate 3-hourly precipitation simulated by a 200 km grid global atmospheric model across a range of climate forcing to have similar statistics as output from a reference fine-grid 25 km model, as measured by its time-mean geographically-resolved pattern, its CDF and its mean diurnal cycle over land.These biases are much reduced compared to previous online correction approaches, but because CycleGAN is a post-processing approach, this comes at the expense of interpretability and Appendix 7.2 of Zhu et al. (2017) for a detailed description of the architecture, and Zhu et al. (2017) Section 3 for a detailed description of the loss function.

Figure 1 .
Figure 1.Inputs and outputs of the CycleGAN for one timestep during the ramping simulation.Precipitation data on the left was used as input to generate precipitation data on the right.Snapshot was selected to illustrate a common feature, significantly stronger precipitation over South America in C384 (replicated by the GAN) than in C48.All snapshots for this simulation can be viewed as a video in McGibbon (2023).

Figure 2 .
Figure2.Annual-mean precipitation from C384 reference run (left column) and precipitation biases from the C48 simulation (right column) and from the GAN applied to this C48 simulation (C384 ML).Bias values are differences from the C384 reference.
of Fulton et al. (2023) for the South Asian monsoon region.A direct comparison with François et al. (2021) and Pan et al. (

Figure 3 .
Figure 3. Metrics of time-average precipitation bias against validation and testing data.Mean bias refers to the area-weighted horizontal mean bias across all samples, or over land samples only.Bias standard deviation refers to the square root of the area-weighted mean square bias, averaged over the horizontal either globally or over land samples only.These statistics are derived from bias maps as shown in Figure 2. "Train" indicates the comparison of the training data itself against the validation data.Training data is not available for the ramping simulation.

Figure 4 .
Figure 4. CDF metrics and diurnal cycle of precipitation over land.The left column shows the CDF of precipitation for each climate.The center column shows the relative magnitude of errors of the values of the C48 and CycleGAN CDFs in the left column across a range of percentiles, computed as a percentage of the C384 (real) value.The right column shows the diurnal cycle of precipitation over land, with the x-axis indicating the starting local time of the 3-hr bin.
François et al. (2021); Pan et al. (2021), and Fulton et al. (2023) used 7.42, 247, and 40.6M samples respectively.Hess et al. (2022) achieved excellent PDF correction results with only 107M training samples of daily precipitation, using coarse global data as in our study.
. The CycleGAN generalizes well to a ramped-SST simulation with intermediate forcings not present in the training Geophysical Research Letters 10.1029/2023GL105131 data set.With a small set of expensive fine-grid simulations, the CycleGAN can thus quickly debias precipitation fields predicted by a fast coarse-grid model across a broad range of climates.