Learning to Correct Climate Projection Biases

The fidelity of climate projections is often undermined by biases in climate models due to their simplification or misrepresentation of unresolved climate processes. While various bias correction methods have been developed to post‐process model outputs to match observations, existing approaches usually focus on limited, low‐order statistics, or break either the spatiotemporal consistency of the target variable, or its dependency upon model resolved dynamics. We develop a Regularized Adversarial Domain Adaptation (RADA) methodology to overcome these deficiencies, and enhance efficient identification and correction of climate model biases. Instead of pre‐assuming the spatiotemporal characteristics of model biases, we apply discriminative neural networks to distinguish historical climate simulation samples and observation samples. The evidences based on which the discriminative neural networks make distinctions are applied to train the domain adaptation neural networks to bias correct climate simulations. We regularize the domain adaptation neural networks using cycle‐consistent statistical and dynamical constraints. An application to daily precipitation projection over the contiguous United States shows that our methodology can correct all the considered moments of daily precipitation at approximately 1° resolution, ensures spatiotemporal consistency and inter‐field correlations, and can discriminate between different dynamical conditions. Our methodology offers a powerful tool for disentangling model parameterization biases from their interactions with the chaotic evolution of climate dynamics, opening a novel avenue toward big‐data enhanced climate predictions.

3 of 26 and observations. Feedback from this discriminative function is applied to learn a bias correction function to close these mismatches. We alternatively train the discriminative function and the bias correction function to improve the skills of both, until the corrected climate simulation is indistinguishable from the observation. This adversarial learning framework offers the potential for leveraging the big data of historical climate observations and simulations to disentangle the parameterization biases with model resolved dynamics, saving the huge effort in forcing a GCM to start from realistic initial conditions to expose its parameterization biases.
Although promising, adversarial learning can be highly unconstrained Wu et al., 2020;Yang et al., 2019). For instance, the bias correction function may map the GCM simulations to limited varieties of climate variability modes, or result in "corrections" that are inconsistent with GCM resolved dynamics. To address these issues, we introduce several statistical and dynamical constraints to regularize the data-driven bias corrector. We show that the Requirements 1-3 in Ehret et al. (2012) can potentially be well satisfied using this regularized adversarial domain adaptation (RADA) methodology. To include feedback effects (Requirement 4), we should either couple the bias corrector with dynamical simulations, or iteratively apply this methodology to correct the parameterization biases and alleviate their impacts on model resolved dynamics. We discuss the difficulties and opportunities here, and leave the tackling of the feedback problem (Requirement 4) for future work.
The rest of the paper is organized as follows. First, in Section 2, we introduce the notion of adversarial learning, the challenges of using an unregularized methodology, as well as the three regularization methods. A case study for correcting daily precipitation projection bias is presented in Section 3, and the results of this case study are described in Section 4. Section 5 discusses the individual and combined contributions of the three regularizers to improve the adversarial learning of GCM biases. Section 6 compares our approach with well-established univariate and multivariate bias correction methods. Section 7 discusses the perceptual performance and the training stopping criteria of the RADA methodology. We draw conclusions and layout the direction for future work in Section 8.

Disentangle Parameterization and Dynamics Using GAN
We distinguish two sets of variables in climate simulation. The first set includes the variables that are directly resolved by discretizing the geophysical fluid dynamical equations, also informally referred to as the dynamics. The second set includes the remaining unresolved variables, also informally referred to as the physics. In climate simulation, the unresolved variables are estimated based on their statistical connections with the resolved variables. These parameterization processes contribute to the majority of climate simulation biases. Our objective is to identify and correct these parameterization biases. The challenge lies in disentangling the intricate parameterization biases from their interactions with the chaotic evolution of climate dynamics.  Ehret et al., 2012).

Regularized Adversarial Domain Adaptation
We summarize the final bias correction methodology here: as is illustrated in Figure 2a, we apply two discriminative neural networks E D and 1 E D to learn arbitrary mismatch between the distribution of historical climate simulation E X and observation E Y . Meanwhile, we adopt dynamics-dependent domain adaptation neural networks ( , ) E G X X S and 1 ( , ) E G Y Y S to close the mismatch under cycle consistency constraint and   performance, and the training should be terminated. A common strategy is to visually inspect the generated results and stop the training if there is no visually perceived improvement. However, as we will show in Section 7, this strategy does not work in our problem, as perceptual judgments can be easily fooled, and yield poor quantitative performance during test. Here, we compare some crucial precipitation indices (first to fourth order moments, see Section 3.4 for details) between corrected simulations and observations for the validation set at the end of each training epoch, and terminate the training when these statistics match best.

General Settings
To test the proposed methodology, we consider a case study of correcting the Community Earth System Model version 2 (CESM2, Danabasoglu et al., 2020) daily precipitation projection over the contiguous United States (CONUS). We consider precipitation because it is among the most important yet least well-simulated variables in GCMs (Marvel & Bonfils, 2013;Tapiador et al., 2019). Later studies may explore the possibility of scaling up this methodology from univariate fields to multivariate fields, from continental scale to global scale, and from moderate spatiotemporal resolution to high resolution, by leveraging the scalability of deep neural networks.
We quantify the precipitation estimation accuracy before and after correction using a suite of commonly applied skill indices computed for a holdout test set. The applied data, data processing methods, model configuration, and evaluation approaches are described below.

GCM Simulation
CESM2 is an open-source, fully coupled, global climate model that provides state-of-the-art computer simulations of the Earth's past, present, and future climate states (Danabasoglu et al., 2020). We use CESM2 historical simulation contributing to the Coupled Model Intercomparison Project phase 6 (CMIP6, Eyring et al., 2016). Besides the daily precipitation products, we use the simulated daily averaged sea level pressure (SLP), geopotential height and specific humidity at 500 hPa ( 500hPa E z and 500hPa E q ) as dynamical constraints to support model training . All the simulation data are of a common resolution of approximately  1 E .

Observation
We use the U.S. National Oceanic and Atmospheric Administration (NOAA) Climate Prediction Center (CPC) unified gauge-based analysis of daily precipitation over CONUS (0. , Poli et al., 2016) are used as dynamical constraints for the observations.

Data Processing
The precipitation field data covering CONUS (  25 W) land areas are used. We apply the dynamical field data with an extra W) to constrain the precipitation field. The observed precipitation, SLP, 500hPa E z and 500hPa E q reanalyzes are regridded to the same resolution as CESM2 simulation using bi-linear spatial averaging (for CPC precipitation observation) or interpolation (for ERA-20C). To accommodate the long-tail distribution nature of precipitation, we regularize the precipitation data using a natural logarithm transformation. We normalize the dynamical variables by subtracting the grid mean, and dividing by the grid standard deviation. It should be noted that, different data processing approaches might influence the model performance. We leave a detailed examination on the impact of data processing for future work.
We split the data into training, validation, and test sets. The training and validation sets cover the period of 1950-1994 (16,435 days to 1994, we put data from the first 8 years into the training set, and the remaining 2 years into the validation set. Data from 1995Data from to 2004 are held out in the test set to offer unbiased evaluation of the bias corrector.

Model Configuration
The RADA model architecture is illustrated in Figure 2. We apply convolutional neural networks (CNNs, Krizhevsky et al., 2012;LeCun & Bengio, 1995) as building blocks for the model, given their capability to exploit the multi-scale spatial coherency of geodata (Baño-Medina et al., 2020;Sadeghi et al., 2020).
For the downscaling submodules of * E F X and * E F Y , we apply a U-net form of CNN ( Figure 2b, Ronneberger et al., 2015). The U-net applies a convolution based contracting path to capture the resolved dynamical field information, and a symmetric transposed convolution based expanding path that gradually refines precipitation field estimation. Skip connections between symmetrical convolution and transposed convolution blocks ease the training by forcing the deeper layers to learn meaningful representations that are not well captured by shallower layers (He et al., 2016). Similarly, we apply CNNs with skip connections for , and train the model for 300 epochs using 36 GPUs on the Wolfram Mathematica 12.1 deep learning platform (Wolfram Research, Inc., n.d.). The training does not require the climate simulation and observation data to be synchronized. Considering the impact of seasonal cycle, internal climate variability, and climate change, in each iteration of a training epoch, we take random mini-batch of daily precipitation observations, and sample climate simulations from the same months and 5 years. This strategy greatly increases the training sample diversity, while excludes the impact of seasonality and climate change. For each training epoch, we shuffle the observational data in the training set, split the data into random mini-batches. For each mini-batch, we sample the corresponding simulation data (same months and 5 years), obtain the gradient of RADA E  with respect to all the model parameters using this mini-batch of data, and update the model parameters accordingly. We repeat this process for all the mini-batches, completing one epoch of training.
We take an ensemble of trained neural networks, each representing a condition where no individual index measuring the precipitation estimation accuracy can be improved without worsening other indices (Pareto front). The considered indices are the spatial average root mean square error between observation's and simulation's first to fourth order moment of precipitation distribution (see Section 3.4). The average of the four neural network outputs are applied as the final bias correction result.

Performance Evaluation
The discriminator E D can in principle detect any mismatch between the distribution of observations and simulations, given that neural networks are universal function approximators (Leshno et al., 1993). Therefore, we could potentially evaluate the bias correction performance based on any statistics. Here we focus on the commonly applied measure of precipitation distribution characteristics, including the moments, quantile, frequency, intensity, and seasonality characteristics at grid scale, as well as spatial coherency and dynamical consistency at field scale. The considered indices are listed in Table 1. For grid scale assessment, we consider the first to fourth order moments (mean, standard deviation, skewness, and kurtosis), the three tertiles (33%, 66%, 99% quantile), the probability of precipitation, the average precipitation intensity, the 1-day, 3-day, and 5-day maximum precipitation, ratio of winter/summer (December-February/June-August) precipitation. We assess the significance of bias correction improvements using the following bootstrap test: We first generate 100 bootstrap samples of precipitation observation and simulation time series before and after correction. Thereafter, we calculate the ratio that the target index after bias correction better matches that of observation. The confidence interval is thereafter determined using a percentile method. All tests are one-tailed and use a confidence level of 99%. We further apply the spatial correlation ( E r ) and spatial average root mean square error (RMSE) between observation and simulation for these indices to roughly summarize the climate simulation performance before and after correction.

Journal of Advances in Modeling Earth Systems
Total precipitation E -Field Intra-field First and second empirical orthogonal function of precipitation field ( Leading orthonormal eigenvectors of precipitation correlation matrix

General Performance
We evaluate the RADA methodology by examining if the distribution of the precipitation indices better match observations after bias correction. Evaluation results for the test set (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004) are presented in Figures 3-8. Each sub-figure here shows the spatial maps of a considered precipitation index for observations (left), CESM2 historical simulations before (middle) and after (right) bias correction.
In general, the RADA methodology demonstrates favorable performance by faithfully representing the spatial variability of precipitation distribution, improving the spatial correlation, and reducing the spatial average RMSE between observations and simulations for all the considered grid-scale indices (Figures 3-6). We achieve significant skill improvements for a broad range of regions (stippled grids on the bias corrected maps), while the limited regions show significant skill drop (stippled grids on the uncorrected maps) for some low-precipitation related indices are mostly from mountain areas, suggesting the need for dedicated effort in modeling orographic effect in future development of our approach. Regarding the field-scale shows locations where the bias correction better matches observation compared to original simulation. The spatial correlation coefficient and spatial average root mean square error between simulation-revealed precipitation indices and observation-revealed precipitation indices are calculated and denoted.

Journal of Advances in Modeling Earth Systems
PAN ET AL.

10.1029/2021MS002509
12 of 26 assessments, our method better simulates the linear spatial structure of the precipitation field (Figure 7), and maintains the coherency between precipitation and the resolved dynamics ( Figure 8).

Moments
For moment based evaluations (Figure 3), the improvements are reflected in the following four aspects. First, we obtain considerable improvements for the western mountain regions, especially for the Sierra Nevada, where the orographic lifting and shadowing impacts from the narrow mountain strips pose notorious difficulty for GCM simulations (Gershunov et al., 2019;Kapnick et al., 2018). Second, localized high precipitation variability incubated by peculiar geographical conditions can be well reproduced, such as for Southern California Bight. Third, the bias correction results better represent the intense and highly variate coastal precipitation along the Gulf of Mexico, and alleviate the overestimation along the eastern coast. Finally, the bias correction results better match observations regarding the inland precipitation distribution.

Quantile, Frequency, and Intensity
For quantile, frequency, and intensity based evaluations (Figures 4 and 5), we significantly alleviate the over estimation of low precipitation events for the Northern Great Plains, the Midwest, and the Southeast (Figure 4a). Also, we significantly draw down the precipitation frequency estimation to better match observations for most regions (Figure 4d). Exceptions are for mountain regions: our model tends to exacerbate   Another favorable aspect of the RADA approach shown here is that, we obtain considerable improvement in simulating precipitation extremes, as reflected in the 99% quantile (Figure 4c), 1-day maximum (Figure 5b), 3-day maximum (Figure 5c), and 5-day maximum (Figure 5d) estimation maps. The improvements are particularly obvious for the west coast and the east north central regions. We can largely attribute this improvement to the following two reasons: first, although extreme precipitations are rare events locally, they are ubiquitous as viewed from a broader, that is, continental, spatial scope. By modeling the continental precipitation field using CNNs, we obtain large sample diversity of extreme events. Second, while most existing climate-related machine learning methods apply maximum likelihood objective, which work

Seasonality
In terms of seasonality ( Figure 6), winter precipitation variability in CONUS is largely dominated by extratropical cyclone related storm tracks. The cyclone-related precipitation can be enhanced or suppressed as the atmospheric flows encounter the orography. The bias correction results better represent the spatial variability of winter precipitation ratio for the orographical complicated regions, such as the western mountain regions (Figure 6a). Also, we improve the representation of the dry winter climate for the Northern Rockies and plains (Figure 6a). For summer precipitation (Figure 6b), improvements are reflected in better representing the spatial variability of precipitation for the North American Monsoon in the southwest, coastal landing of tropical cyclones for the Gulf of Mexico and southeast coast. Furthermore, we also enhance the summer precipitation ratio to better match observations for the great plain.

Spatial Consistency
Regarding the spatial consistency of the precipitation estimation, we draw the first and second leading EOFs of the observed, simulated, and bias-corrected daily precipitation fields (Figure 7). The EOFs represent the spatial loading of the leading principal components (PCs) that account for most precipitation variability. Therefore, these EOFs could roughly tell the linear spatial structure of the precipitation field. For 1 EOF E (Figure 7a), the bias correction result agrees better with observations, with a spatial correlation coefficient improved from 0.85 to 0.93, and the spatial RMSE dropped from , suggesting that the leading mode of precipitation spatial variability can be better revealed using our approach. Besides, we lowered the ratio of explained variance of 1 PC E from 2.22% to 1.51%, which better matches observation (1.29%). For 2 EOF E (Figure 7b), while there is a considerable discrepancy among the maps, in particular over the Western U.S., the bias correction results still agree better with the observation, especially for the Southeast. Bias corrected results achieve a spatial correlation of 0.57 regarding 2 PC E loading, as compared to 0.15 in the original CESM2 simulation. These results suggest that the RADA methodology can maintain, and even improve, the spatial consistency of the target variable.

Inter-Field Correlation
Considering the inter-field correlation between precipitation and the resolved dynamics, we evaluate how the simulation-based and observation-based statistical models * E F X and * E F Y ) work for the observation and simulation data by calculating the E r skill of these two models ( Figure 8). Conventionally, we consider GCM biases at a single variable level, the assessments here examine GCM biases at a functional relationship level.
For the simulation-based downscaling results (Figure 8a), * E F X , which is trained using CESM2 historical simulation data, works particularly well for estimating CESM2 simulated precipitation using the simulated dynamical field information, achieving a spatial average E r skill score of 0.89 during test (Figure 8a middle,   * ( ), E r F X X S X ). However, this GCM revealed inter-field relationship works far worse when applied to observations, with the average E r skill score dropped to 0.42 (Figure 8a left,   * ( ), E r F X Y S Y ). This drop of skill can result from the following two reasons, first, the GCM imposes an overly stringent constraint on simulated precipitation variability; second, the downscaling relationship revealed the GCM has non-linear biases.
We dissect the impacts from both sides by examining the observation-based downscaling results (Figure 8b). The impact of the first aspect (over stringency) is reflected in the fact that, * E F Y , trained using historical observational data, cannot pose as stringent a constraint on observed precipitation (Figure 8b left,   as * E F X when applied to simulation data (Figure 8a middle,   * ( ), E r F X X S X ). In other words, the GCM underestimates the conditional variability of the parameterized variable given the resolved dynamics. The impact of the second aspect (non-linear bias) is reflected in the fact that, * E F Y achieves an overall higher E r skill score for compared to estimating the observed precipitation (Figure 8b left,   , the over stringency problem is not fully resolved by this deterministic bias correction approach. An implication here is that we may need to inject random noise into the bias corrector for probabilistic estimation of the unresolved precipitation processes, as is explored in stochastic parameterization practices (Berner et al., 2017). We revisit this issue in Section 8.2.1.

Ablation Analysis
We have introduced three regularizers to enhance adversarial learning of GCM biases, namely, cycle consistency (Section 2.2.2), dynamical consistency (Section 2.2.3), and dynamical dependency (Section 2.2.4). Here, we quantify the contribution from each regularizer or their combinations using an ablation analysis method: we monitor the change of bias correction performance by systematically removing the regularizers in the RADA network architecture. The combinations of these three regularizers consist seven GAN-based bias corrector baselines.
We show the bias correction results for each of the baseline models in Figures S1-S7 in Supporting Information S1. Each figure there draws the spatial maps of all the considered precipitation indices for observations and simulations before and after bias correction. To compare models' performances, Figure 9 draws models' relative spatial correlation skills: for each model and each considered precipitation distribution characteristics, we first calculate the spatial correlation between the observation-revealed precipitation indices and simulation-revealed precipitation indices, thereafter, we re-scale the models' correlation skills to the range of [0, 1] by subtracting the minimum correlation skill achieved by the nine considered precipitation simulations (CESM2 original simulation, RADA bias correction, and seven GAN-based bias correction), and dividing the result by the difference between the maximum and minimum correlation skill achieved by these models. The skills of CESM2 original simulation (gray dashed) and RADA corrected simulation (black solid) are plotted as benchmarks.
The key findings are summarized here. Applying unconstrained GAN for bias correction yields poor performance regarding most of the considered precipitation distribution characteristics (Figure 9a). This is because, GANs are trained to produce individual trust-worthy samples, not accurate probability distribution estimations (Goodfellow, 2016). The overall performance is generally improved as more regularizers are included to constrain the bias corrector (Figures 9b-9g), although the inclusion of a new regularizer might hamper the performance for certain aspects. The RADA bias corrector containing all three regularizers achieves overall the best performance compared to the rest GAN-based models. These evidences empirically confirm the arguments made in Section 2.2.1, and justify the contribution of the proposed regularizers.
The quantile mapping approach neglects the spatial consistency of the target variable in its correction. Regarding this deficiency, many multivariate bias correction methods have been developed, although the robustness and benefits of these multivariate methods are frequently questioned (Maraun & Widmann, 2018;Räty et al., 2018), and remain to be rigorously tested (François et al., 2020;Maraun, 2016). Here we consider the Multivariate Bias Correction with N-dimensional probability density function transform (MBCn, Cannon, 2018;Pitie et al., 2005) approach to correct precipitation projection biases while maintaining the spatial consistency. MBCn is adopted for the following three reasons: First, MBCn is claimed to achieve competitive performance compared to univariate or other multivariate bias correction approaches (François et al., 2020;Cannon, 2018). Second, the computation cost for scaling up MBCn to high dimensional data Figure 9. Relative spatial correlation skills for the seven generative adversarial neural network (GAN)-based bias correction models (a-g). The GAN-based baseline models are obtained by removing one or several regularizers on the RADA network architecture. The remaining regularizers for these models are denoted in the legend. For each model and each considered precipitation distribution characteristics, we calculate the spatial correlation between the observation-revealed precipitation indices and simulation-revealed precipitation indices. We re-scale the models' correlation skills to the range of [0, 1] using min-max normalization based on correlation skills achieved by the seven baseline models, the original CESM2 simulation, and the Regularized Adversarial Domain Adaptation (RADA) model. The skills of CESM2 original simulation (gray dashed) and RADA corrected simulation (black) are plotted as benchmarks.
PAN ET AL.

10.1029/2021MS002509
18 of 26 is relatively low. Third, we do not need to pre-assume the parametric distribution form of the considered climate variable.
MBCn maps a multivariate source distribution to a same-dimensional target distribution by iteratively applying the following three steps. First, rotate the multivariate source distribution and target distribution with a same uniformly distributed random orthogonal matrix. Second, apply quantile mapping to match the source distribution with the target distribution for each of the rotated dimensions. Third, apply the inverse rotation matrix to yield source values for the next iteration. By the end of each iteration, an energy distance function (Rizzo & Székely, 2016) is applied to measure the distance between corrected source distribution and target distribution. We consider two strategies for stopping the iterative bias correction process: the first strategy, following the setting in Cannon (2018), is to stop the iteration as the energy distance function evaluated for the training set converges; the second strategy, as advocated in the discussion section of Cannon (2018), is to use early stopping (Prechelt, 1998): we stop the iteration as the energy distance function evaluated for the validation set stops decreasing. We evaluate the model performance using the test set data. The implementations can be found in the Supporting Information S1. More details can be found in Cannon (2018) and François et al. (2020).
The comparison results are presented in a similar manner as Section 5. We show the bias correction results for quantile mapping and MBCn in Figure S8 in Supporting Information S1 (quantile mapping) and Figures S9 and S10 in Supporting Information S1 (MBCn without/with early stopping). Each figure draws the spatial maps of all the considered precipitation indices for observations and simulations before and after bias correction. Figure 10 draws models' relative spatial correlation skills (see Section 5 for computation details).
The quantile mapping approach (Figure 10a) obtains good performance for correcting grid-scale precipitation distribution characteristics, slightly surpassing the RADA approach regarding some of the consid- , 66% E Q , 99% E Q , and Int). These are reasonable results as we empirically align the simulated precipitation distribution to match observed precipitation distribution for each grid cell. Meanwhile, the RADA approach demonstrates advantage for better matching observations regarding the higher-order moments (Skew and Kurt), 33% E Q , the probability of precipitation, the daily maximum precipitation, the seasonal precipitation ratio, and all the field-scale precipitation characteristics. For grid-scale assessments, the disadvantage of quantile mapping here may due to the fact that these statistics are sensitive to sampling variance, requiring more samples to build empirical CDFs that better characterize these statistics, while the advantage of RADA is achieved by generalizing from limited available data using deep neural networks. For field-scale assessments, the disadvantage of quantile mapping is due to the fact that it does not explicitly take into consideration of spatiotemporal consistency or inter-field correlation, while the RADA is designed to overcome these deficiencies. The MBCn approach (Figures 10b and 10c, red) obtains relatively lower skill compared to either the quantile mapping approach or the RADA approach. Even worse, many of the statistical indices, after MBCn bias correction, match less well with observations. This degradation of skill is particularly obvious if the early stopping strategy is not adopted (Figure 10b). We provide the following explanations for these phenomena. First, most of the reported implementations of MBCn apply cross-validation evaluation (Cannon, 2018;François et al., 2020), which tend to yield misleadingly promising performance, see Maraun et al. (2017) and Maraun & Widmann (2018) for theoretical and empirical proofs. As we evaluate the MBCn approach using a hold-out test set, we find the iterative bias correction strategy in MBCn generalizes poorly, as compared to either the straightforward, one-step quantile mapping approach, or the sophisticated, but well regularized RADA approach. Second, in each iteration, MBCn uses a random orthogonal matrix to partially de-correlate the precipitation field data, and carry out bias correction on the basis of the orthogonal matrix. This linear assumption may not work well for the highly variate precipitation field data. Overall, the results here suggest the difficulty for extending univariate bias correction to multivariate cases, and highlight the contribution of the RADA methodology.
Finally, it is worth noticing that, many recent works have tried to correct climate model biases conditioned on model resolved dynamics (Addor et al., 2016;Bellprat et al., 2013;Wetterhall et al., 2012), which echoes the idea of applying resolved dynamics to regularize the bias correction in the RADA methodology. However, these existing works often overly simplify the dynamical states conditioned on which to correct model biases. For instance, Manzanas and Gutiérrez (2019) showed that, conditioning the bias corrector on large scale circulation patterns, such as El Niño/Southern Oscillation (Manzanas & Gutiérrez, 2019), enhances model's seasonal forecasting capability. The RADA methodology suggests a potential direction for correcting cli mate model biases conditioned on more detailed dynamical state information.

Perceptual Evaluation
The RADA methodology, derived from the idea of adversarial learning, builds an evolving game between the neural network parameterized bias detector and bias corrector under cycle-consistent statistical and dynamical constraints. The overall loss function ( RADA E  , Equation 5) does not directly inform the model performance or when to stop the training. In many GAN applications, the training is stopped when there is no perceptual improvement of the generated samples (Salimans et al., 2016). Here, we evaluate the perceptual performance of the RADA methodology. We empirically prove that, compared to the judgment made by the neural network parameterized bias detector, the perceptual judgment made by human beings can be easily fooled in our problem setting, yielding less optimal bias correction performance.
We consider a binary distinction of climate observation samples and simulation samples, judged by a human being test taker and a separately trained deep neural network. This problem formulation is similar to the climatic Turing test proposed by Palmer (2016). The deep neural network applied here has same architecture as the RADA discriminative neural network, but unlike the setting in Wasserstein GAN, the weights are not clipped to fully exploit its discrimination power. Both the human being and the deep neural network are trained to distinguish historical climate observation samples and simulation samples, and are tested to make distinctions for a holdout test set. We compare the human being and the deep neural network performance for the distinction task before and after RADA bias correction. Results are shown in Figure 11.
The top of Figure 11 shows randomly selected samples of precipitation observation maps (first row) and the CESM2 precipitation simulation maps before (second row) and after (third row) bias correction. The simulations do not match the observations, as the freely evolving historical climate simulations do not follow the evolution of historical climate variability. Although unpaired, we can still easily distinguish the observations (first row) and the original simulations (second row): the observations are of higher spatial variability, and are featured by well-pronounced orographical impact; the simulations are of lower spatial variability, and are featured by well-expanded drizzles. The RADA bias correction results (third row) largely overcome the perceptual deficiencies, while still matching the spatial distribution pattern of the original simulations.
The bottom of Figure 11 summarizes the performance of the human being test taker (top) and the deep neural network (bottom) for distinguishing observation samples and simulation samples before (left) and after (right) bias correction. Here, TP, FP, TN, and FN are true positive rate, false positive rate, true negative

Journal of Advances in Modeling Earth Systems
PAN ET AL.
10.1029/2021MS002509 20 of 26 rate, and false negative rate, respectively. Before bias correction, both the human being test taker and the deep neural network can well distinguish the observation samples and the simulation samples. While the human being test taker misclassifies 5% of the samples, the accuracy of the deep neural network is nearly perfect. After bias correction, the human being test taker assigns 67.92% of simulation samples to observations, while the deep neural network shows a large advantage in the distinction task. These results suggest that, first, the perceptual judgment made by human beings is not as accurate as the judgment made by the discriminative deep neural network. In the current problem setting and many other physical applications, we should not rely purely upon perceptual judgment to evaluate the model performance, or to determine the training stopping criteria. Second, since a separately trained deep neural network can still distinguish the observation samples and the simulation samples with moderate accuracy, the current bias correction model may not have fully exploited the bias information in the data. We discuss the limitations of the model, and suggest potential directions for improvements in Section 8.2. Last, while the climatic Turing test (Palmer, 2016) was originally proposed as a subjective evaluation of climate models, our methodology shows that, this Turing test can serve as a powerful tool for not only model diagnosis, but also model improvement. Eventually, this methodology may potentially allow us to build a digital twin (Bauer et al., 2021) of the climate system.

Contributions
The accuracy of climate projection is continuously marred by biases of climate models, due to their simplification or misrepresentation of unresolved climate processes. The community has reached a consensus that, to significantly improve climate projection accuracy in the next-generation climate models, we should leverage the information hidden in multi-resolution simulations and multi-source observations, using the powerful tools of machine learning Schneider et al., 2017). Current endeavors along this direction mainly focus on learning data-driven parameterization schemes from high-resolution simulations, either deterministically Yuval & O'Gorman, 2020) or probabilistically (Gagne et al., 2020;Schneider et al., 2020).
Despite these efforts, climate models equipped with enhanced parameterization schemes (either conceptual based, or machine learning based) may still disagree with observations in many aspects, reflecting the gap between parameterization development and application, and highlighting the sophisticated structure of climate model biases. These biases cannot be readily revealed by contrasting historical climate simulation and observation time series, as the freely evolving historical climate simulations do not follow the evolution of actually observed historical climate variability. To identify and alleviate these biases, we may either carry out costly process-oriented diagnosis, or adjust model output distribution toward observations regardless of physical justifications. It remains challenging to efficiently apply the simulation-observation "disagreement" information to diagnose and improve climate models.
In this study, we develop the regularized adversarial domain adaptation (RADA) methodology that learns to identify and correct climate projection biases using unpaired climate simulation and observation data. We are inspired by the fact that, climate experts can easily distinguish maps of observations and model simulations, even though they represent different climate states (Palmer, 2016). As we apply a deep neural network for this observation-simulation distinction task, the evidences based on which this network derives its judgment roughly tell where the model bias lies, which can be inversely applied to train a domain adaptation network to have simulations less distinguishable from observations. Eventually, this may lead us to the ideal of digital twin (Bauer et al., 2021) for developing climate models. The adversarial learning process can be highly unconstrained, yielding superficially authentic yet physically unreasonable results. We turn the three requirements that Ehret et al. (2012) conceived for a "perfect" bias corrector into regularizers to support the adversarial learning of climate model biases. Specifically, we include cycle consistency regularization to encourage model to produce identical simulations as observations under same resolved dynamical states; we include dynamical consistency regularization to ensure spatiotemporal consistency as well as inter-field correlations in bias correction; we make the bias corrector dynamical dependent to discriminate between different dynamical conditions. We apply this RADA methodology for correcting the Community Earth System Model version 2 daily precipitation projection over the Contiguous United States. Results show that our methodology can correct all the considered moments of daily precipitation at approximately  1 E resolution, ensure spatiotemporal consistency and inter-field correlations, and discriminate between different dynamical states. We apply an ablation study to justify the contribution of the introduced regularizers. Compared to existing univariate and multivariate bias correction approach, our methodology can well correct all the considered statistics at grid scale, and perform better in maintaining intra-field and inter-field consistency.
The key message we want to deliver is that, we should learn from big data of climate observations and simulations to criticize and improve the ever-complicated climate models. Logically, sophisticated climate models should come with equally sophisticated diagnosis tools (Hofstadter, 1979), which could be built upon our methodology. We summarize three of our key contributions as follows: 1. We develop a powerful and efficient paradigm for learning climate projection biases using unpaired climate simulation and observation data. 2. We introduce useful regularization approaches to support physically consistent adversarial learning of climate projection biases. New regularizers can be more easily included following the strategies developed here. 3. We describe comprehensive training and evaluation details for implementing and assessing the proposed data-driven bias correction methodology, demonstrating the advantage of our approach over existing univariate or multivariate bias correction methods.

Journal of Advances in Modeling Earth Systems
PAN ET AL.

Limitations and Future Work
We have developed an adversarial learning based data-driven climate model bias corrector (RADA), equipped with three regularizers targeted toward the three requirements that Ehret et al. (2012) conceived for a "perfect" bias correction algorithm. Here, we highlight four limitations of the current RADA methodology, and discuss our future work directions.

Scalability
The current RADA methodology is applied to correct daily precipitation projection at  1 E resolution. In order to make this methodology applicable for GCMs with increasingly high resolutions, we should address the following two problems: first, how to scale the learning algorithm to high spatiotemporal resolution cases; second, how to include stochastic component to account for the inherent uncertainty in parameterizing the unresolved processes (Berner et al., 2017). Below, we discuss the challenges and possible solutions of these two problems.
Since the development of the first GAN model (Goodfellow et al., 2014), many works have explored the possibility of scaling adversarial learning to large models and datasets, achieving consistent progresses regarding training stability, computation efficiency, and output quality (Karras et al., 2017(Karras et al., , 2019Kurakin et al., 2016;Radford et al., 2015). In particular, Karras et al. (2017) showed that, starting from a low resolution, as we progressively add new layers to refine the model output details, we can achieve robust, high-resolution results in adversarial learning. In our future work, we plan to apply this idea to extend the RADA methodology from moderate resolution to high resolution.
While it is relatively straightforward to follow existing works to scale the learning machine to high resolution cases, we should notice the physical limitation here: as the GCM resolution is increased, there is no clear scale separation between resolved dynamics and parameterized physics (Gerard, 2007). In this case, the resolved dynamics cannot support a strict one-to-one mapping between observations and simulations. This partially explains why the cycle consistency regularization (Section 2.2.2) does not significantly enhance bias correction performance (Figure 9b). To address this problem, we should include stochastic component to account for the inherent uncertainty in parameterizing the unresolved processes (Berner et al., 2017). Similar issue has been studied in the machine learning literature (Almahairi et al., 2018;Park et al., 2020). Instead of learning strict one-to-one mapping between domains, these methods are designed to learn probabilistic, many-to-many mappings. In particular, by adding auxiliary latent variables to the cycle-consistency GAN model, Almahairi et al. (2018) showed that we can learn mappings that capture diversity of the outputs in domain adaptation. In our future work, we plan to explore similar idea to learn stochastic bias correction for high dimensional cases.

Explainability
A unique feature of the RADA methodology is that, we do not pre-assume where the GCM biases are, and apply universal function approximators (deep neural nets, Leshno et al., 1993) to identify (via E D ) and correct (via E G ) these biases. Although this setting may potentially allow us to identify and correct arbitrary GCM biases, the spatiotemporal characteristics of the GCM biases and their dependency on the GCM resolved dynamics remain vague. To apply this modeling framework for climate studies, the neural network should provide human-understandable justifications for its outputs, and help model developers to diagnose and improve GCMs. Several neural network interpretability techniques, including saliency maps (Simonyan et al., 2013), occlusion sensitivity analysis (Zeiler & Fergus, 2014), backward optimization (Olah et al., 2017) and layerwise relevance propagation (Bach et al., 2015;Toms et al., 2020), could be applied to verify if the RADA methodology identifies meaningful model biases, enhance our understanding of model deficiencies, and support model improvement.

Applicability in a Changing Climate
We have trained the RADA model using historical climate simulation and observation data from 1950 to 1994, and test the bias correction performance using a hold-out test set, covering 1995-2004. The results suggest that the RADA methodology applies well for a "future," changing scenario. Meanwhile, the evaluation here does not guarantee that the bias corrector trained using historical data can extrapolate well in a severely changed climate, such as the climate projections forced by high greenhouse gas emission scenarios. On the one hand, we should strictly exam the bias correction performance for different climate periods; on the other hand, we should apply the detailed bias information offered by the RADA methodology to diagnose and improve the GCMs, translating the black-box model diagnosed bias information into robust knowledge encoded in climate models. The following section offers a potential direction to achieve this goal.

Feedback Effects
The three regularizers in the RADA methodology target toward the three requirements that Ehret et al. (2012) conceived for a "perfect" bias correction algorithm. So far, we have not explicitly considered the feedback effect associated with GCM parameterization biases, which consists the last requirement from Ehret et al. (2012), as well as the most difficult challenge in correcting climate projection biases.
There are potentially two directions toward tackling the feedback effects from a bias correction perspective, one is to couple the bias corrector with dynamical simulations, the other is to iteratively correct the parameterization biases and alleviate their impacts on model resolved dynamics.
The latter direction is more feasible as it poses lower requirement on the spatiotemporal resolution of the bias corrector, and lighter burden on model development. Provided that the bias corrector can pinpoint detailed spatiotemporal bias information without costly initialization and long-term running of climate models, we could obtain sufficient guidance for calibrating and improving climate models (Anderson & Lucas, 2018;Gong et al., 2016;Hourdin et al., 2017;Rasp et al., 2018;Yuval & O'Gorman, 2020). By iteratively training and applying this bias corrector to overcome the parameterization biases, we may gradually alleviate their impacts on model dynamics, therefore resolve the feedback impacts of parameterization biases.
To support the blueprint drawn above, we should test if the bias corrector can offer detailed spatiotemporal bias information. A perfect test bed is weather forecast postprocessing. Historically, postprocessing in weather forecast and bias correction in climate projection are considered as two disciplines (Maraun, 2016), although they result from similar model biases.
The barrier that prevents us from carrying out a weather forecast postprocessing verification of our methodology is that, retrospective weather forecast experiments usually apply different GCM settings compared with climate projections, regarding the resolution, numerical solver, and parameterization choices. The community has realized the necessity for applying an integrated forecasting system across weather to seasonal scale. Here we advocate that the unification should extend to climate scale for better identification and correcting of forecasting model biases. Before this is carried out, our future works plan to test if the learned bias corrector can enhance weather to seasonal forecast in a semi-supervised learning manner, alleviating the sample limitation problem in learning forecasting biases.