On the limitations of deep learning for statistical downscaling of climate change projections: The transferability and the extrapolation issues

Convolutional neural networks (CNNs) have become one of the state‐of‐the‐art techniques for downscaling climate projections. They are being applied under Perfect‐Prognosis (trained in a historical period with observations) and hybrid approaches (as Regional Climate Models (RCMs) emulators), with satisfactory results. Nevertheless, two important aspects have not been, to our knowledge, properly assessed yet: (1) their performance as emulators for other Earth System Models (ESMs) different to the one used for training, and (2) their performance under extrapolation, that is, when applied outside of their calibration range. In this study, we use UNET, a popular CNN, to assess these two aspects through two pseudo‐reality experiments, and we compare it with simpler emulators: an interpolation and a linear regression. The RCA4 regional model, with 0.11° resolution over a complex domain centered in the Pyrenees, and driven by the CNRM‐CM5 global model is used to train the emulators. Two frameworks are followed for the training: predictors are taken (1) from the upscaled RCM and (2) from the ESM. In both frameworks, the performance of the UNET when applied for other ESMs different to the one used for training is considerably worse, indicating poor generalization. For the linear method a similar deterioration is seen, so this limitation does not seem method specific but inherent to the task. For the second experiment, the emulators are trained in present and evaluated in future, under extrapolation. While averaged aspects such as the mean values are well simulated in future, significant biases (up to 5°C) appear when assessing warm extremes. These biases are larger by UNET than those produced by the linear method. This limitation suggests that, for variables such as temperature, with a marked signal of change and a strong linear relationship with predictors, simple linear methods might be more appropriate than the sophisticated deep learning techniques.

K E Y W O R D S convolutional neural networks, deep learning, emulators, EURO-CORDEX, evaluation, extrapolation, pseudo-reality, regional climate models, statistical downscaling

| INTRODUCTION
There is a growing need for high resolution climate change projections for impact and adaptation studies.This need is usually met by increasing the resolution of global simulations with some sort of downscaling.Two main approaches are possible: dynamical and statistical downscaling (SD), and they have been widely reviewed (Benestad et al., 2008;Charles et al., 2004;Huth et al., 2015;Jacob et al., 2020;Maraun et al., 2010;Rummukainen, 2010;Trzaska & Schnarr, 2014;Wilby et al., 2004;Wilby & Wigley, 1997;Zorita & von Storch, 1999).Dynamical downscaling usually consists in nesting a high-resolution model, such as a Regional Climate Model (RCM), within a lower-resolution model, such as an Earth System Model (ESM), while SD is based on the existence of statistical relationships between large scale variables (predictors) and local weather (predictands).Since dynamical downscaling is based on physical laws, one of its advantages is the physical consistency among downscaled variables.Nonetheless, its computational expense makes it difficult to use this strategy for the generation of large ensembles (Trzaska & Schnarr, 2014).On the other hand, SD is less computationally expensive, allowing exploration of uncertainties through the generation of large ensembles, but two major drawbacks are the need for historical observations and the stationarity assumption it relies on; SD is based on the assumption that the predictors/predictand relationships are maintained under future climate change, which is not granted and cannot be directly tested due to the lack of observations for the future (Charles et al., 2004;Trzaska & Schnarr, 2014;Wilby et al., 2004).
Several strategies have been proposed to indirectly assess the transferability of SD methods to different climates though.One possible approach is to use the coldest/wettest years of a historical record to train methods, and then evaluate them over the warmest/driest years (see Gutiérrez et al., 2013;Olmo & Bettolli, 2022;San-Martín et al., 2017) for temperature/precipitation.This approach is limited to the observed variability though.Another approach is to downscale future simulations and to study the impact of downscaling on the long term trends.Ideally, downscaling techniques should preserve ESMs trends in the large scale (see Baño-Medina et al., 2021;Hernanz et al., 2023;Vandal et al., 2019;Xu et al., 2020).This approach is limited to the analysis of the spatial scales in which ESMs operate (coarse resolution), and might hide imperfections both in the spatial and temporal finer scales.And a third approach is to use pseudo-observations (RCM outputs) to train and test (in the present and future, respectively) statistical methods (see Charles et al., 1999;Gaitan et al., 2014;Hernanz et al., 2022a).This strategy allows to detect errors in the finer scales and to explore a wider range of climate change than the first approach, but the use of pseudo-observations instead of actual observations introduces an additional source of uncertainty.
Recently a new hybrid approach, RCM emulators (Doury et al., 2023), has been proposed combining the advantages of both dynamical and statistical downscaling.This strategy makes use of statistical methods to emulate the behavior of an RCM.Thus, while traditionally SD methods are trained with observations in a historical period, emulators are trained using the RCM outputs as predictands, so their training is not restricted to the historical climate.This approach presents several advantages over traditional SD and RCMs: (1) the use of future simulations for the training enables emulators to be trained with a wider range of climate states than SD under Perfect Prognosis (where the training is done in a historical period with observations), avoiding thus potential problems arising from the stationarity assumption.And (2), with the use of emulators, large ensembles can be produced from a reduced set of RCM simulations.A historical scenario and a high end emission scenario can be used to train an emulator and then produce future intermediate scenarios at low computational expense.On the other hand, the main disadvantage of this approach is that RCMs biases are maintained and must be adjusted or taken into account.
Deep learning (DL; see LeCun et al., 2015;Schmidhuber, 2015, for an overview) is a growing field with many applications, including climate downscaling.Convolutional Neural Networks (CNNs) can deal with large amounts of data and they present an important advantage over other statistical methods; their ability to extract high-level spatial features automatically (LeCun et al., 1998;LeCun & Bengio, 1995).CNNs have become one of the main stateof-the-art downscaling techniques, both as Perfect Prognosis SD methods and as hybrid approaches (see Baño-Medina et al., 2020, 2021;Höhlein et al., 2020;Liu et al., 2023;Passarella et al., 2022;Serifi et al., 2021;Vandal et al., 2017Vandal et al., , 2019)).The particular implementation UNET (Ronneberger et al., 2015) has been widely used for image recognition with great performance and different variations have been also satisfactorily applied to climate downscaling (Doury et al., 2023;Sha et al., 2020aSha et al., , 2020b;;Sharma & Mitra, 2022).Doury et al. (2023) proposed an emulator based on UNET to reproduce an RCM behavior.First, the emulator was trained with a RCM nested on an ESM both in a historical scenario and in the Representative Concentration Pathway (RCP) 8.5 (see IPCC, 2013), and then the emulator was evaluated over an intermediate scenario (RCP4.5),driven by the same ESM, with good results.Wang et al. (2021) used a similar emulator for precipitation, also with satisfactory results.Nevertheless, being the purpose of emulators to produce large ensembles (multiple scenarios and models) at low computational cost, their performance over ESMs different to the one used for calibration should be assessed.
Additionally, to our knowledge, the extrapolation capability of CNNs has not yet been assessed using pseudo-observations, so the finer scales can be analyzed.Hsieh (2009) and Hernanz et al. (2022b) pointed to important potential problems of neural networks and other machine learning algorithms when they are applied beyond their calibration range.
In this study, we expand the emulator proposed by Doury et al. (2023) to assess its performance driven by other ESMs different to the one used for training, and we also analyze the behavior of the deep learning tool UNET under extrapolation.The document is organized as follows.First, in Section 2 a description of the datasets, experiments, methods and evaluation metrics is provided.In Section 3 evaluation results are shown and commented on.And finally, discussion and conclusions are presented in Section 4.

| METHODOLOGY
In the following subsections, a description of the datasets used, the experiments design, the emulator architecture and the evaluation metrics is provided.

| Data and experiments design
This study focuses on the downscaling of surface daily mean temperature over a small but complex domain centered over the Pyrenees, including part of the Mediterranean and Atlantic coasts of Spain and France, and the Balearic Islands (see Figure 1).The predictand consists of 2345 land grid points, from a 64 Â 64 RCM grid points domain, over a rotated grid of 0.11 .Predictors cover a larger region (55.5 N, À9 W, 33 N, 13.5 E) corresponding to a 16 Â 16 ESM grid points domain with a resolution of 1.5 (all ESMs are interpolated to the same grid using a bilinear interpolation).For this study, the following datasets have been used (see Table 1).
The first experiment tests the generalization of the emulator to different ESMs.In this experiment the emulator is trained with RCA4 driven by CNRM-CM5 under RCP8.5, and then it is applied and evaluated for the four ESMs under the intermediate scenario RCP4.5 (2006RCP4.5 ( -2100)).The reason for this choice is that the final purpose of RCM emulators is to replace some RCM simulations by the emulator.The best strategy for this is to produce an extreme scenario with the RCM and then generate intermediate scenarios (avoiding thus potential problems related with extrapolation) with the emulator.The relationship between large scale and local variables for a RCM is stronger if the large scale variables are taken from the RCM itself and not from the driving ESM (see Doury et al., 2023).Thus, an additional set of predictors is used in some cases, and it consists in the RCM upscaled to the coarse resolution (1.5 ) using a conservative interpolation, what is referred to as Upscaled Regional Climate Model (UPRCM).Two evaluation frameworks are explored: (1) the Perfect Model Framework and (2) the Model World Framework (see Doury et al., 2023).In the Perfect Model Framework, predictors for training are taken from the UPRCM, and in the Model World Framework they are taken from the driving ESM instead.In this study, we have applied and evaluated the emulator for the four ESMs plus the UPRCM, following both training frameworks.It should be noticed that evaluating the emulator for the UPRCM does not represent a realistic practical case, but a theoretical optimum benchmark to compare with.
The second experiment assesses the extrapolation capability of the emulator.In this experiment, the emulator is trained under the intermediate scenario RCP4.5 and only in the present (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021)(2022)(2023)(2024)(2025) and evaluated under the extreme scenario RCP8.5 in the future (2081-2100).For this experiment only the UPRCM has been used (both for training and testing), aiming for the best possible conditions for the emulator.It should be noticed that the extrapolation issue is not as crucial for emulators as it is for the Perfect Prognosis approach.Being emulators trained not only with historical data but also with future projections, the range of the training dataset is wider and extrapolation problems are not significant.Nevertheless, this experiment aims at highlighting the extrapolation problems that UNET can suffer from when applied under a Perfect Prognosis approach.The reason for assessing potential weaknesses of the Perfect Prognosis approach using the hybrid approach is that it allows to evaluate statistical methods under a larger extrapolation.The variables used as predictors are: temperature, zonal wind and meridional wind at 850 hPa and 500 hPa, geopotential height at 500 hPa, specific humidity at 850 hPa and mean sea level pressure.The choice of predictors has been conditioned by availability, prioritizing predictors commonly used in statistical downscaling and avoiding variables strongly dependent on the model parameterizations, such as cloud cover or radiation, for example.They are standardized using their mean and standard deviation for the reference period (2006-2035) over the emission scenario used for training.Doury et al. (2023) proposed applying predictors a smoothing filter (averaging values over 3 Â 3 grid boxes) previous to the standardization and the downscaling.This proceeding was based on Klaver et al. (2020), where the conclusion that the effective resolution of ESMs is often larger (about 3 times) than their nominal resolution was reached.Such a preprocess has been done but similar results were reached (not shown).

| Emulator description
The emulator here used is the same proposed by Doury et al. (2023) which in turn is an adaptation of the original UNET (Ronneberger et al., 2015).UNET is a popular architecture designed for biomedical purposes.It is a fully convolutional neural network, consisting of a series of four encoding blocks followed by a series of four decoding blocks which are, in addition, connected by bridges.Encoding blocks are formed by three layers: two of them convolutional (with 3 Â 3 kernels) and then a max pooling (down-sampling operation through the use of maximum filters) with 2 Â 2 filters.Similarly, decoding blocks are inversely formed by a 2 Â 2 up-sampling and two 3 Â 3 convolutional layers.
This architecture is usually represented by a U-shaped network which gives the model its name.In its original design, UNET was used for image segmentation, which is a classification problem.In this case, UNET tackles a regression problem, because temperature is a continuous numerical variable.Thus, the original UNET has been modified in two main ways: (1) the output layer uses a REctified Linear Unit function (ReLU, more appropriate for regression tasks than the original sigmoid) and (2) the loss function and metric used are the root mean squared error (instead of the original binary cross entropy and accuracy, respectively).An Adam optimizer (Kingma & Ba, 2014) with learning rate of 0.005 has been used, with 100 epochs and a batch size of 32.Overfitting has been handled by the use of early stopping (stopping the gradient descent once no more improvement is found after k iterations in a validation set, with k typically called patience) with a patience of 15.For a more detailed description, see Doury et al. (2023).This emulator based on UNET is compared to simpler benchmarks: an emulator consisting on a bilinear interpolation (INT) and another one consisting in a multiple linear regression (MLR).For the MLR, predictors are taken from the four nearest neighbors and interpolated to each target point.
These metrics are calculated at grid point scale.When presented in the form of maps, they are accompanied by their means (M) and their super-quantiles of order 0.05 (SQ05) and 0.95 (SQ95) in order to summarize the maps in a few values.The super-quantile α is defined as the mean of all the values larger (resp.smaller) than the quantile of order α. 3 | RESULTS

| Perfect model framework
In the first experiment, the emulator is trained under RCP8.5 and evaluated in RCP4.5.First, the strengths of the emulator are to be proven.Figures 1 and 2 show scores for the emulator UNET compared to the interpolation and the linear method in the Perfect Model Approach (trained and evaluated for UPRCM).In this framework, daily scores by UNET are systematically better than by the other methods (Figure 1).But when UNET is applied for the other ESMs (even for CNRM-CM5, the one used for driving the RCM in the training dataset), its RMSEs increase up to around 2-2.5 C, being similar to the ones by MLR (Figure 3).Thus, it has been proved that despite the good performance of UNET when applied for the same ESM that it has been trained with (UPRCM in this case, but similar results are reached with CNRM-CM5, not shown), when applied for other ESMs its performance is considerably worse

| Second experiment: Extrapolation
For the second experiment, the one focusing on extrapolation, we present results for the present and future climatologies as well as for the delta changes for the mean temperature and for the 99th percentile.Figure 4 shows how the mean climatology and delta change in it is well captured by UNET and MLR, but as it has been mentioned, the analysis only of averaged aspects (either temporary or spatially) can hide imperfections in the finer scales.When analyzing the 99th percentile, UNET presents low biases (around À1 C to À0.5 C) in the present climatology (inside its calibration range), but for the future climatology, these extreme values present significant biases (mostly between À0.5 C and 2 C, but up to 5 C in some cases), larger than those given by the linear method (as well as for the delta change as a consequence).

| DISCUSSION AND CONCLUSIONS
Deep learning methods based on CNNs are state-of-theart downscaling methods, being used both for statistical downscaling under Perfect Prognosis and for hybrid approaches, such as RCM emulators.In this study, we have identified two important limitations for both applications: (1) RCM emulators appear to be ESM dependent (i.e., their performance is considerably different when evaluated for ESMs different to the one used during their calibration) and ( 2) their performance under extrapolation (values outside of the calibration range) is poor.For this study we have used the popular UNET, a specific implementation that has shown great performance in many fields including climate downscaling.The first limitation is not exclusive of CNNs, but it is also seen in linear methods.That indicates that the predictors/predictand relationships established by the RCM are different for different driving ESMs.A possible explanation for this finding could be related with some  overfitting by the emulators.Another possible explanation could be that the set of predictors used is not enough to explain all the predictand variability, and predictors containing important information might have been missing (e.g., aerosols, clouds, radiation, surface processes, or other variables dependent on each ESM parameterizations).In particular, near surface temperature is also highly dependent on soil water content which in turn is responsible for the partition of sensible and latent surface heat fluxes affecting near surface temperature and humidity.Both heat fluxes are the main mechanism to turn back energy into the atmosphere from the land surface.Consequently, the Bowen ratio (the ratio of sensible to latent heat flux), not usually contemplated among the atmospheric predictors, would help to further explain the temperature variability (Rodríguez-Camino & Avissar, 1998).Further investigation in this way might be fruitful in order to build, if possible, emulators capable of generalizing for different ESMs.These results point to only a moderate potential benefit on the use of RCM emulators.Being a large ensemble composed of multiple emission scenarios (N) and multiple driving ESMs (M), for a single RCM, the scenario/ESM matrix does not seem feasible to be filled with emulators from only a few RCM simulations.Had results been similar for other ESMs, only 2 RCMs simulations (a historical one and a high-end one) would be enough to fill the N Â M matrix.This way 2 Â M RCMs simulations are needed (historical + highend, for each driving ESM) for emulators to produce accurate results.
As for the second limitation, being extrapolation a well-known potential issue for any statistical method, errors under extrapolation in the fine spatial and temporal scales are not often assessed when evaluating new methods.In this study, we have demonstrated how CNNs trained in the present can reproduce accurately the mean climatology both in present and under climate change (averaged aspects), but for extreme temperatures (the tail of the distribution), important errors emerge in the future climate (outside of the calibration range).This is rarely assessed, and such errors can often remain hidden if only averaged aspects are evaluated.Nonetheless, these errors can lead to wrong conclusions for impact and adaptation studies.For variables such as temperature, where the predictors/predictand relationships are quite linear and the signal of change is strong (large amount of data projected outside of the calibration range), simple linear methods might be more suitable than sophisticated deep learning techniques.
Needless to say that these experiments have been carried out for a particular RCM and configuration, but an expansion to other RCMs/configurations would lead to more robust conclusions.Similarly, these conclusions have been reached using a particular deep learning approach and a set of evaluation metrics, but other are possible and might lead to different conclusions.

F
I G U R E 1 Daily RMSE ( C), variance ratio (%), temporal correlation and Wasserstein distance (in columns) by INT, MLR and UNET (in rows) in the complete period (2006-2100).The models have been trained and evaluated in the Perfect Model Framework (trained with the UPRCM driven by CNRM-CM5 under RCP8.5 and evaluated with the UPRCM driven by CNRM-CM5 under RCP4.5).

F
I G U R E 2 RCM truth and bias by INT, MLR and UNET (in columns) for the mean temperature ( C, present climatology in first row and future delta change in second row) and for the 99th percentile ( C, present climatology in third row and future delta change in fourth row).The present climatology corresponds to 2006-2025 and the future delta change corresponds to the difference between 2081-2100 and 2006-2025.The models have been trained and evaluated in the Perfect Model Framework (trained with the UPRCM driven by CNRM-CM5 under RCP8.5 and evaluated with the UPRCM driven by CNRM-CM5 under RCP4.5).

F
I G U R E 3 Daily RMSE ( C) for the complete period (2006-2100).The models have been trained in the Perfect Model Framework (UPRCM driven by CNRM-CM5 under RCP8.5) and evaluated over the UPRCM driven by CNRM-CM5, CNRM-CM5, HadGEM2-ES, IPSL-CM5A-MR and NorESM1-M (from left to right) under RCP4.5.Each box (MLR in blue and UNET in orange) summarizes the distribution of the 2345 grid points by the median and the quartiles; whiskers extend to a maximum of 1.5 times the interquartile range.F I G U R E 4 Bias in the mean temperature ( C, first row) and the 99th percentile ( C, second row) in the present climatology (first column, 2006-2025), the future climatology (second column, 2081-2100) and the delta change (third column, difference between 2081-2100 and 2006-2025).The models have been trained and evaluated in the Perfect Model Framework (UPRCM driven by CNRM-CM5 under RCP4.5 for training and RCP8.5 for evaluation).Each box (MLR in blue, and UNET in orange) summarizes the distribution of the 2345 grid points by the median and the quartiles; whiskers extend to a maximum of 1.5 times the interquartile range.
T A B L E 1 Regional Climate Model and Earth System Models used.CNRM-CM5 is the ESM used for driving the RCA4 RCM during the training.